* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) [not found] ` <1x0qG-Dr-3@gated-at.bofh.it> @ 2004-03-12 21:15 ` Andi Kleen 2004-03-18 19:50 ` Peter Zaitsev 0 siblings, 1 reply; 100+ messages in thread From: Andi Kleen @ 2004-03-12 21:15 UTC (permalink / raw) To: Peter Zaitsev; +Cc: linux-kernel Peter Zaitsev <peter@mysql.com> writes: > > Rather than changing design how time is computed I think we would better > to go to better accuracy - nowadays 1 second is far too raw. Just call gettimeofday(). In near all kernels time internally does that anyways. -Andi ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-12 21:15 ` 2.4.23aa2 (bugfixes and important VM improvements for the high end) Andi Kleen @ 2004-03-18 19:50 ` Peter Zaitsev 0 siblings, 0 replies; 100+ messages in thread From: Peter Zaitsev @ 2004-03-18 19:50 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel On Fri, 2004-03-12 at 13:15, Andi Kleen wrote: > Peter Zaitsev <peter@mysql.com> writes: > > > > Rather than changing design how time is computed I think we would better > > to go to better accuracy - nowadays 1 second is far too raw. > > Just call gettimeofday(). In near all kernels time internally does that > anyways. Right, gettimeofday() was much slower some years ago on some other Unix Platform, which is why time() was used instead. Now we just need to fix a lot of places (datatypes, prints etc) to move to gettimeofday() -- Peter Zaitsev, Senior Support Engineer MySQL AB, www.mysql.com Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL) http://www.mysql.com/uc2004/ ^ permalink raw reply [flat|nested] 100+ messages in thread
[parent not found: <1woEJ-7Yx-25@gated-at.bofh.it>]
[parent not found: <1wp8c-7x-5@gated-at.bofh.it>]
[parent not found: <1wprd-qI-21@gated-at.bofh.it>]
[parent not found: <1wpUz-Tw-21@gated-at.bofh.it>]
[parent not found: <1x293-2nT-7@gated-at.bofh.it>]
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) [not found] ` <1x293-2nT-7@gated-at.bofh.it> @ 2004-03-12 21:25 ` Andi Kleen 0 siblings, 0 replies; 100+ messages in thread From: Andi Kleen @ 2004-03-12 21:25 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, andrea Ingo Molnar <mingo@elte.hu> writes: > but i'm quite strongly convinced that 'getting rid' of the 'pte chain > overhead' in favor of questionable lowmem space gains for a dying > (high-end server) platform is very shortsighted. [getting rid of them > for purposes of the 64-bit platforms could be OK, but the argumentation > isnt that strong there i think.] pte chain locking seems to be still quite far up in profile logs of 2.6 on x86-64 for common workloads. It's nonexistent in mainline 2.4. I would consider this a strong reason to do something about that. -Andi ^ permalink raw reply [flat|nested] 100+ messages in thread
[parent not found: <20040304175821.GO4922@dualathlon.random>]
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) [not found] <20040304175821.GO4922@dualathlon.random> @ 2004-03-04 22:14 ` Rik van Riel 2004-03-04 23:24 ` Andrea Arcangeli 0 siblings, 1 reply; 100+ messages in thread From: Rik van Riel @ 2004-03-04 22:14 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Peter Zaitsev, mbligh, linux-kernel On Thu, 4 Mar 2004, Andrea Arcangeli wrote: > On Thu, Mar 04, 2004 at 07:12:23AM -0500, Rik van Riel wrote: > > All the CPUs use the _same_ mm_struct in kernel space, so > > all VM operations inside the kernel are effectively single > > threaded. > > so what, the 3:1 has the same bottleneck too. Not true, in the 3:1 split every process has its own mm_struct and they all happen to share the top GB with kernel stuff. You can do a copy_to_user on multiple CPUs efficiently. > or maybe you mean the page_table_lock hold during copy-user that Andrew > mentioned? (copy-user doesn't mean "all VM operations" not sure if you > meant this or the usual locking of every 2.4/2.6 kernel out there) True, there are some other operations. However, when you consider the fact that copy-user operations are needed for so many things they are the big bottleneck. Making it possible to copy things to and from userspace in a lockless way will help performance quite a bit... -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 22:14 ` Rik van Riel @ 2004-03-04 23:24 ` Andrea Arcangeli 2004-03-05 3:43 ` Rik van Riel 0 siblings, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-04 23:24 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, Peter Zaitsev, mbligh, linux-kernel On Thu, Mar 04, 2004 at 05:14:30PM -0500, Rik van Riel wrote: > > or maybe you mean the page_table_lock hold during copy-user that Andrew > > mentioned? (copy-user doesn't mean "all VM operations" not sure if you > > meant this or the usual locking of every 2.4/2.6 kernel out there) > > True, there are some other operations. However, when could you name one that is serialized in 4:4 and not in 3:1 with an mm lock? just curious. there are tons of VM operations serialized by the page_table_lock that hurts with threads in 3:1 too. I understood only copy-user needs the additional locking. > you consider the fact that copy-user operations are > needed for so many things they are the big bottleneck. > > Making it possible to copy things to and from userspace > in a lockless way will help performance quite a bit... I don't expect an huge speedup but certainly it would be measurable. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 23:24 ` Andrea Arcangeli @ 2004-03-05 3:43 ` Rik van Riel 0 siblings, 0 replies; 100+ messages in thread From: Rik van Riel @ 2004-03-05 3:43 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Peter Zaitsev, mbligh, linux-kernel On Fri, 5 Mar 2004, Andrea Arcangeli wrote: > On Thu, Mar 04, 2004 at 05:14:30PM -0500, Rik van Riel wrote: > > > or maybe you mean the page_table_lock hold during copy-user that Andrew > > > mentioned? (copy-user doesn't mean "all VM operations" not sure if you > > > meant this or the usual locking of every 2.4/2.6 kernel out there) > > > > True, there are some other operations. However, when > > could you name one that is serialized in 4:4 and not in 3:1 with an mm > lock? just curious. there are tons of VM operations serialized by the > page_table_lock that hurts with threads in 3:1 too. I understood only > copy-user needs the additional locking. Yeah, in case of a threaded workload you're right. For a many-processes workload the locking optimisations definately made a different, IIRC. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 100+ messages in thread
* 2.4.23aa2 (bugfixes and important VM improvements for the high end)
@ 2004-02-27 1:33 Andrea Arcangeli
2004-02-27 4:38 ` Rik van Riel
0 siblings, 1 reply; 100+ messages in thread
From: Andrea Arcangeli @ 2004-02-27 1:33 UTC (permalink / raw)
To: linux-kernel
this includes some relevant fix for 2.4, if I missed something important
please let me know. I'm not following 2.4 mainline anymore since it's a
lot of work and I'm trying to ship with a 2.6-aa soon.
The most interesting part of this update is a vm improvement for the
high end machines, this makes it possible to swap several gigs
efficiently while doing I/O on the very high end
>=32G machines, this is for all archs (it's unrelated to x86 or the zone
normal). See below for details.
URL:
http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.23aa2.gz
Diff between 2.4.23aa1 and 2.4.23aa2:
Only in 2.4.23aa2: 00_blkdev-eof-1
Allow reading last block (write not).
Only in 2.4.23aa1: 00_csum-trail-1
Obsoleted (can't trigger anymore).
Only in 2.4.23aa2: 00_elf-interp-check-arch-1
Make sure interpreter is of the right architecture.
Only in 2.4.23aa1: 00_extraversion-33
Only in 2.4.23aa2: 00_extraversion-34
Rediff.
Only in 2.4.23aa2: 00_mremap-1
Various mremap fixes.
Only in 2.4.23aa2: 00_ncpfs-1
Limit dentry name length (from Andi).
Only in 2.4.23aa2: 00_rawio-crash-1
Handle partial get_user_pages correctly.
Only in 2.4.23aa2: 05_vm_28-shmem-big-iron-1
Make it possible to swap efficiently huge amounts of shm. The
stock 2.4 VM algorithm aren't capable of dealing with huge amounts
of shm in huge machines. This has been a showstopper in production on
32G boxes swapping regularly (the shm in this case wasn't pure fs cache
like in oracle so it really had to be swapped out).
There are three basic problems that interact in non obvious manners, all
three fixed in this patch.
1) This one is well known and infact it's already fixed in 2.6 mainline,
too bad the way it's fixed in 2.6 mainline makes 2.6 unusable at all
in a workload like the one running in these 32G high end machines and
2.6 now will have to be changed because of that. Anyways returning to
2.4: the swap_out loop with pagetable walking doesn't scale for huge
shared memory. This is obvious. With half a million of pages mapped
some hundred times in the address space, when we've to swap shm, before
we can writepage a single shm dirty page with page_count(page) == 1,
we've first to walk and destroy the whole address space. It doesn't
only unmap 1 single page, but it unmaps everything else too. All the
address spaces in the machine are destroyed in order to make single
shared shm pages freeable, and that generates a constant flood of
expensive minor page faults. The way to fix it without any downside
(except purerly theorical ones exposed by Andrew last year, and you
can maliciously waste the same cpu using truncate) is a mix between
objrmap (note: objrmap has nothing to with the rmap as in 2.6), and the
pagetable walking code. In short during the pagetable walking I check
if a page is freeable and if it's not yet, I execute the objrmap on it to
make it immediatly freeable if the trylocking permits. This allow
swap_out to make progress and to unmap one shared page at time, instead
of unmapping all of them from all address spaces, before the first one
becomes freeable/swappable. Some lib function in the patch is taken
from the objrmap patch for 2.6 in the mbligh tree implemented by IBM
(thanks to Martin and IBM for maintaining that patch uptodate for 2.6,
that is a must-have starting point for the 2.6 VM too). The original
idea of using objrmap for the vm unmapping procedure is from David
Miller (objrmap itself has always existed in every linux kernel out
there, to provide mmap coherency through vmtruncate, and now
it is being used from the 2.4-aa swap_out pagetable walking too).
To give an idea the top profiled function during swapping 4G of shm on
the 32G box now is try_to_unmap_shared_vma.
2) the writepage callback of the shared memory converts a shm page
into a swapcache page, but it doesn't start the I/O on the swapcache
immediatly, this is a nosense and it's easy to fix by simply calling
the swapcache writepage within the shm_writepage before returning
(then of course we must not unlock the page anymore before returning,
the I/O completion will unlock it).
Not starting the I/O it means we'll start the I/O after another million
pages and then after starting the I/O we'll notice the finally free
page after another million page pass.
There was a super swap storm was happening because of these issues,
especially the interaction between point 1 and point 2 was detrimental
(with 1 and 2 I mean the phase from unmapping the page to starting the
I/O, the last phase from starting the I/O to effectivly noticing a
freeable clean swapcache page is addressed in point 3 below).
In short if you had to swap 4k of shm, the kernel would immadiatly move
into swapcache more than 4G (G is not a typo ;) of shm, and then it was
starting the I/O to swap all those 4G out because it was all dirty and
freeable cache (from the shmfs fs) queued contigously in the lru. So
after the first 4k of shm to be swapped out, every further allocation
would generate a swapout for a very very long time. Reading a file
would involve swapping the amount of data read as well, not to tell if
you were reading into empty vmas, that would generate two times more
swapping than the I/O itself. So machines that were swapping slightly
(around 4G on a 32G box) were entering a swap storm mode with very huge
stalls lasting several dozen minutes (you can imagine if every memory
allocation was executing I/O before returning).
Here a trace of that scenario as soon as the first 35M of address space
become freeable and moved into swapcache, this is the point
where basically all shm is already freeable but dirty, you see,
machine hangs with 100% system load (8 cpus doing nothing but
keeping throwing away address space with the background
swap_out, and calling shm_writepage until all shm is converted
to swapcache).
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
33 1 35156 507544 80232 13523880 0 116 0 772 223 53603 24 76 0 0
29 3 78944 505116 72184 13342024 0 124 0 2092 153 52579 27 73 0 0
29 0 114916 505612 70576 13174984 0 0 4 10556 415 26109 31 69 0 0
29 0 147924 506796 70588 13150352 0 0 12 7360 717 3944 32 68 0 0
25 1 108764 507192 68968 12942260 0 0 8 10989 1111 4886 19 81 0 0
27 0 266504 505104 64964 12717276 0 0 3 3122 223 2598 56 44 0 0
24 0 418204 505184 29956 12550032 0 4 1 3426 224 5242 56 44 0 0
16 0 613844 505084 29916 12361204 0 0 4 80 135 4522 23 77 0 0
20 2 781416 505048 29916 12197904 0 0 0 0 111 10961 13 87 0 0
22 0 1521712 505020 29880 11482256 0 0 40 839 1208 4182 13 87 0 0
24 0 1629888 505108 29852 11374864 0 0 0 24 377 537 13 87 0 0
27 0 1743756 505052 29852 11261732 0 0 0 0 364 598 11 89 0 0
25 0 1870012 505136 29900 11135420 0 0 4 254 253 491 7 93 0 0
24 0 2024436 505160 29900 10981496 0 0 0 24 125 484 10 90 0 0
25 0 2287968 505280 29840 10718172 0 0 0 0 116 2603 11 89 0 0
23 0 2436032 505316 29840 10570856 0 0 0 0 122 418 3 97 0 0
25 0 2536336 505380 29840 10470516 0 0 0 24 115 389 2 98 0 0
26 0 2691544 505564 29792 10316112 0 0 0 0 125 443 8 92 0 0
27 0 2847872 505508 29836 10159752 0 0 0 226 138 482 7 93 0 0
24 0 2985480 505380 29836 10022836 0 0 0 524 146 491 5 95 0 0
24 0 3123600 505048 29792 9885668 0 0 0 149 112 397 2 98 0 0
27 0 3274800 505116 29792 9734396 0 0 0 0 112 377 5 95 0 0
28 0 3415516 505156 29792 9593624 0 0 0 24 109 1822 8 92 0 0
26 0 3551020 505272 29700 9458072 0 0 0 0 113 453 7 93 0 0
26 0 3682576 505052 29744 9326488 0 0 0 462 140 461 5 95 0 0
26 0 3807984 505016 29744 9200960 0 0 0 24 125 458 3 97 0 0
27 0 3924152 505020 29700 9085040 0 0 0 0 121 514 3 97 0 0
28 0 4109196 505072 29700 8900812 0 0 0 0 120 417 6 94 0 0
23 2 4451156 505284 29596 8558780 0 0 4 24 122 491 9 91 0 0
23 0 4648772 505044 29600 8361080 0 0 4 0 124 544 5 95 0 0
26 2 4821844 505068 29572 8187904 0 0 0 314 132 492 3 97 0 0
After all address space has been destroyed multiple times
and all shm has been converted into swapcache, the kernel can
finally swapout the first 4kbyte of shm:
6 23 4560356 506160 22188 8440104 0 64888 16 67280 3006 534109 6 88 5 0
The rest is infinite swapping and machine total hung, since those 4.8G
of swapcache are now freeable and dirty, and the kernel will not
notice the freeable and clean swapcache generated by this 64M
swapout, since it's being queued at the opposite side of the lru
and we'll find another million pages to swap (i.e. writepage)
before noticing that. So the kernel will effectively start
generating the first 4k of free memory only after those half
million pages have been swapped out.
So point 1 and fixes the kernel so that we unmap only 64M of shm, and we
swapout then immediatly, but we can still run into huge problems in
having half million pages of shm queued in a row in the lru.
3) After fixing 1 and 2 a file copy was still not running (yeah it was running
but some 10 or 100 times slower than it had to, preventing any backup
activity to work etc..) after the machine was 4G into swap. But at least
the machine was swapping just fine, so the database and the
application itself was doing pretty good (no swap storms anymore
after fixing 1 and 2), it's the other apps allocating further
cache that didn't work yet.
So I tried reading an huge multi gigs file, I left the cp
running for a dozen minutes at a terribly slow rate (despite the
data was on the SAN), and I noticed this cp was pushing the
database into swap another few gigs more, precisely as much as
the size of the file. During the read the swapout greatly
exceeded the reads from the disk (so-bi from vmstat). In short
the cache allocated on the huge file, was beahaving like a
memory leak. But it was not a memory leak, it was a very fine
clean and freeable cache reacheable from the lru, too bad we
still had an hundred thousand shm pages being unmapped (by the
background obrjmap passes) to walk and to shm->writepage and
swapcache->writepage before noticing the immediatly freeable
gigs of clean cache of the file.
In short the system was doing fine, but due the ordering of the lru, it was
preferring to swap gigs and gigs of shm pushing the running db into
swap, instead of shrinking the several unused (and absolutely
not aged) gigs of clean cache. This wasn't too good also for
efficient swapping because it was taking a long time before the
vm could notice the clean swapcache, after it started the I/O on
it.
It was pretty clear after that, that we've to prioritize and to
prefer discarding memory that is zerocost to collect, than to do
extremely expensive things to release free memory instead. This
change isn't absolutely fair, but the current behaviour of the
vm is an order of magnitude worse in the high end. So the fix I
implemented is to run a inactive_list/vm_cache_scan_ratio pass
on the clean immediatly freeable cache in the inactive list
before going into the ->writepage business and whala, copies
were running back at 80M/sec like if no 4G were being swapped
out at the same time. Since swap, data and binaries are all on
different places (data on the SAN, swap on a local IDE, binaries
on another local IDE), swapping while copying didn't hurt too
much either.
A better fix would be to have an anchor in the lru (can be a per-lru
page_t with a PG_anchor set) and to avoid the clean-cache search to
alter the point where we keep swapping with writepage, but it
shouldn't matter that much and 2.4 being obsolete isn't very
worthwhile to make it even better.
On the low end (<8G) these effects that hangs machines for hours and makes
any I/O impossible weren't visible because the lru is more mixed and there
are never too many shm pages in a row. The less ram the less this effect
is visible. However even on small machines now swapping shm should be
more efficient now. The downside is that now the pure clean
cache is penalized against dirty cache but I believe it worth to
pay for this downside to survive and handle the very high end
workloads.
I didn't find very attractive doing these changes in 2.4 at this time,
but 2.6 has no way to run in those workloads, and no 2.6 kernel is
certified for some products yet etc.. This code is now running in
production and people seems happy. I think this is the first
linux ever with a chance of handling properly this high end
workload on high end hardware, so if you had problems give this
a spin. 2.6 obviously will lockup immediatly in this workloads
due the integration of rmap in 2.6 (already verified just to be
sure my math was correct) and 2.4 mainline as well will run oom
(but no lockup in 2.4 mainline since it can handle zone normal
failures gracefully unlike 2.6) since it lacks pte-highmem.
If this slowdown the VM on the low end (i.e. 32M machine,
probably the smaller box where I tested it is my laptop with 256M ;)
try increasing the vm_cache_scan_ratio sysctl and for any
regression please notify me. Thank you!
Only in 2.4.23aa1: 21_pte-highmem-mremap-smp-scale-1
Only in 2.4.23aa2: 21_pte-highmem-mremap-smp-scale-2
Fix race condition if src is cleared under us while
we allocate the pagetable (from 2.6 CVS, from Andrew).
Only in 2.4.23aa2: 30_20-nfs-directio-lfs-1
Read right page with directio.
Only in 2.4.23aa2: 9999900_overflow-buffers-1
paranoid check (dirty buffers should always be <= NORMAL_ZONE,
and the other counters for clean and locked are never read,
so this is a noop but it's safer).
Only in 2.4.23aa2: 9999900_scsi-deadlock-fix-1
Avoid locking inversion.
^ permalink raw reply [flat|nested] 100+ messages in thread* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 1:33 Andrea Arcangeli @ 2004-02-27 4:38 ` Rik van Riel 2004-02-27 17:32 ` Andrea Arcangeli 0 siblings, 1 reply; 100+ messages in thread From: Rik van Riel @ 2004-02-27 4:38 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel On Fri, 27 Feb 2004, Andrea Arcangeli wrote: > becomes freeable/swappable. Some lib function in the patch is taken > from the objrmap patch for 2.6 in the mbligh tree implemented by IBM > (thanks to Martin and IBM for maintaining that patch uptodate for 2.6, > that is a must-have starting point for the 2.6 VM too). The original > idea of using objrmap for the vm unmapping procedure is from David > Miller (objrmap itself has always existed in every linux kernel out Good to hear that you're finally convinced that some form of reverse mapping is needed. I agree with you that object based rmap may well be better for 2.6, if you want to look into that I wouldn't mind at all. Especially if we can keep akpm's and Nick's nice VM balancing intact ... > The rest is infinite swapping and machine total hung, since those 4.8G > of swapcache are now freeable and dirty, and the kernel will not > notice the freeable and clean swapcache generated by this 64M > swapout, since it's being queued at the opposite side of the lru An obvious solution for this is the O(1) VM stuff that Arjan wrote and I integrated into rmap 15. Should be worth looking into this for the 2.6 kernel ... Basically it keeps the just-written pages near the end of the LRU, so they're easily found and freed, before the kernel even starts thinking about submitting the other gigabytes of dirty data for writeout. > efficient swapping because it was taking a long time before the > vm could notice the clean swapcache, after it started the I/O on > it. ... Arjan's O(1) VM stuff ;) > It was pretty clear after that, that we've to prioritize and to > prefer discarding memory that is zerocost to collect, than to do > extremely expensive things to release free memory instead. I'm not convinced. If we need to free up 10MB of memory, we just shouldn't do much more than 10MB of IO. Doing just that should be cheap enough, after all. The problem is when you do two orders of magnitude more writes than the amount of memory you need to free. Trying to do zero IO probably isn't quite needed ... > vm is an order of magnitude worse in the high end. So the fix I > implemented is to run a inactive_list/vm_cache_scan_ratio pass > on the clean immediatly freeable cache in the inactive list Should work ok for a while, until you completely run out of clean pages and then you might run into a wall ... unless you implement smarter cleaning & freeing like Arjan's stuff does. Then again, your stuff will also find pages the moment they're cleaned, just at the cost of a (little?) bit more CPU time. Shouldn't be too critical, unless you've got more than maybe a hundred GB of memory, which should be a year off. > A better fix would be to have an anchor in the lru (can be a per-lru > page_t with a PG_anchor set) and to avoid the clean-cache search to > alter the point where we keep swapping with writepage, but it > shouldn't matter that much and 2.4 being obsolete isn't very > worthwhile to make it even better. Hey, that's Arjan's stuff ;) Want to help get that into 2.6 ? ;) cheers, Rik -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 4:38 ` Rik van Riel @ 2004-02-27 17:32 ` Andrea Arcangeli 2004-02-27 19:08 ` Rik van Riel 0 siblings, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-02-27 17:32 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel On Thu, Feb 26, 2004 at 11:38:20PM -0500, Rik van Riel wrote: > On Fri, 27 Feb 2004, Andrea Arcangeli wrote: > > > becomes freeable/swappable. Some lib function in the patch is taken > > from the objrmap patch for 2.6 in the mbligh tree implemented by IBM > > (thanks to Martin and IBM for maintaining that patch uptodate for 2.6, > > that is a must-have starting point for the 2.6 VM too). The original > > idea of using objrmap for the vm unmapping procedure is from David > > Miller (objrmap itself has always existed in every linux kernel out > > Good to hear that you're finally convinced that some form > of reverse mapping is needed. I'm convinced about that since April 2003 (and it wasn't 1st april joke when I posted this ;) http://groups.google.it/groups?q=g:thl1757664681d&dq=&hl=it&lr=&ie=UTF-8&selm=20030405025008%2463d0%40gated-at.bofh.it&rnum=19 quote "Indeed. objrmap is the only way to avoid the big rmap waste. Infact I'm not even convinced about the hybrid approch, rmap should be avoided even for the anon pages. And the swap cpu doesn't matter, as far as we can reach pagteables in linear time that's fine, doesn't matter how many fixed cycles it takes. Only the complexity factor matters, and objrmap takes care of it just fine." It's not like I wakeup yesterday with the idea of changing it, it's just that nobody listened to my argument for almost one year so now we're stuck with a 2.6 kernel that has no way to run in the high end (i.e. >200 tasks with 2.7G of shm mapped each on a 32G box, with 2.4-aa I can reach ~6k tasks with 2.7G mapped each). Then Andrew pointed out that there are complexity issues that objrmap can't handle but I'm not concerned about the complexity issue of objrmap since no real app will run into it and it's mostly a red herring since you should be able to trigger the same complexity issues with truncate already. the thing I'm against for years and that I'm still against is "rmap" not objrmap. "rmap" is what prevents 2.6 from being able to handle high end workloads. Not even 4:4 can hide the rmap overhead, even ignoring the slowdown provided by 4:4, you still lockup at 4 times more address space mapped, and 4 times more address space than what 2.6 can do now, is still a tiny fraction of what my current 2.4-aa can map with objrmap. rmap has nothing to do with objrmap. you know objrmap is available in every linux kernel out there (2.0 probably had objrmap too), this is why it's zero cost for linux to use objrmap, we just start using it for the paging mechanism for the first time, instead of building a new redundant extremely zone-normal costly infrastructure (i.e. rmap). I also don't buy the 64bit argument, since the waste is there, it's just not a showstopper blocker in 64bit, but the fact it's a blocker for 32bit archs is a good thing so we're forced to optimize 64bit archs too. as for remap_file_pages I've two ways: 1) implicit mlock and allow it only to root (i.e. cap lock capability) or under your sysctl that enables mlock for everybody, this is the simple lazy way, and I think this is what you're doing in 2.4 too 2) use a pagetable walk on every vma marked VM_NONLINEAR queued into the address space, we need that pagetable walk anyways for doing truncate perfect To avoid altering the API probably the first remap_file_pages should set the VM_NONLINEAR on the vma (maybe it's already doing that, I didn't check). so I think 2 is the best, sure one can argue it will waste cpu, but this is just a correctness thing, it doesn't need to be fast at all, I prefer to optimize for the fast path. > I agree with you that object based rmap may well be better > for 2.6, if you want to look into that I wouldn't mind at > all. Especially if we can keep akpm's and Nick's nice VM > balancing intact ... Glad we like it both. So my current plan is to do objrmap for all file mappings first (this is a blocker showstopper issue or 2.6 simply will lockup immediatly in the high end, already verified just to be sure I wasn't missing something in the code), then convert remap_file_pages to do the pagetable walk instead of relying on rmap, then I can go further and add a dummy inode for anonymous mappings too during COW like DaveM did originally. Only then I can remove rmap enterely. This last step is somewhat lower prio. > > The rest is infinite swapping and machine total hung, since those 4.8G > > of swapcache are now freeable and dirty, and the kernel will not > > notice the freeable and clean swapcache generated by this 64M > > swapout, since it's being queued at the opposite side of the lru > > An obvious solution for this is the O(1) VM stuff that Arjan > wrote and I integrated into rmap 15. Should be worth looking > into this for the 2.6 kernel ... > > Basically it keeps the just-written pages near the end of the > LRU, so they're easily found and freed, before the kernel even > starts thinking about submitting the other gigabytes of dirty > data for writeout. > > > efficient swapping because it was taking a long time before the > > vm could notice the clean swapcache, after it started the I/O on > > it. > > ... Arjan's O(1) VM stuff ;) > > > It was pretty clear after that, that we've to prioritize and to > > prefer discarding memory that is zerocost to collect, than to do > > extremely expensive things to release free memory instead. > > I'm not convinced. If we need to free up 10MB of memory, we just if you're not convinced I assume Arjan's O(1) VM stuff is not doing that, which means Arjan's O(1) VM stuff has little to do with the point 3. point 1 is objrmap, you've rmap instead. point 2 is "start I/O from shm_writepage" and you definitely want that too, O(1) VM can workaround for the lack of "start I/O from shm_writepage" but it's a wrong workaround for that, I/O must be started immediatly from shm_writepage, no point to delay it, when writepage is called I/O must be started, waiting another small pass of the o1 vm is wasted cpu. As I wrote this was a trivial two liner that you can easily merge and it's orthogonal with all other issues. I see o1 vm may be hiding this stupidity of shm_writepage for you but you will get a benefit from the proper fix. the only similarity between my stuff and Arjan's O(1) VM stuff is in point 3, but only for the "searching of the clean swapcache". Since you're not convinced about the above it means you don't get right the "clean cache is a memleak without -aa latest stuff". My stuff takes care of both issues at the same time, clearly Arjan's O(1) VM stuff can be stacked on top of my stuff if I wanted to, to reach more quickly the swapped out swapcache now clean. This is something I don't want to do because I think that such O(1) VM is not O(1) at all, infact it may end up wasting a lot more cpu depending on the allocation frequency and the disk speed, since you've no guarantee that when you run into the page again you will find it unlocked, so it may still be under I/O, so it may actually waste cpu instead of saving it. Overall Arjan's O(1) VM stuff can't help in the workload I was dealing with, since it's all about freeing clean cache first and it's not doing that, freeing clean swapcache is important too, but it's not as important and avoiding I/O if we can. > shouldn't do much more than 10MB of IO. Doing just that should be > cheap enough, after all. > > The problem is when you do two orders of magnitude more writes than > the amount of memory you need to free. Trying to do zero IO probably > isn't quite needed ... in small machines the current 2.4 stock algo works just fine too, it's only when the lru has the million pages queued that without my new vm algo you'll do million swapouts before freeing the memleak^Wcache. all I care about is to avoid the I/O if I can, and that's the only thing my patch is doing. This is about life or death of the machine, it's not a fast/slow issue ;). > > vm is an order of magnitude worse in the high end. So the fix I > > implemented is to run a inactive_list/vm_cache_scan_ratio pass > > on the clean immediatly freeable cache in the inactive list > > Should work ok for a while, until you completely run out of > clean pages and then you might run into a wall ... unless you > implement smarter cleaning & freeing like Arjan's stuff does. I understand the smart cleaning & freeing, it's not obvious as said above for the cpu waste that it risks to generate, and the "put it near the end of the lru" one has to define "near" which is a mess. I'm not saying o1 vm will necessairly waste cpu, I'm just saying it's non obvious and it's in no way similar or equivalent to my code. > Then again, your stuff will also find pages the moment they're > cleaned, just at the cost of a (little?) bit more CPU time. exactly, that's an important effect of my patch and that's the only thing that o1 vm is taking care of, I don't think it's enough since the gigs of cache would still be like a memleak without my code. > Shouldn't be too critical, unless you've got more than maybe > a hundred GB of memory, which should be a year off. I think these effects starts to be visible over 8G, the worst thing is that you can have 4G in a row of swapcache, in smaller systems the lru tends to be more intermixed. > > A better fix would be to have an anchor in the lru (can be a per-lru > > page_t with a PG_anchor set) and to avoid the clean-cache search to > > alter the point where we keep swapping with writepage, but it > > shouldn't matter that much and 2.4 being obsolete isn't very > > worthwhile to make it even better. > > Hey, that's Arjan's stuff ;) Want to help get that into 2.6 ? ;) I think you mean he's using an anchor in the lru too in the same way I proposed here, but I doubt he's using it nearly as I would, there seems to be a fundamental difference in the two algorithms, with mine partly covering the work done by his, and not the other way around. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 17:32 ` Andrea Arcangeli @ 2004-02-27 19:08 ` Rik van Riel 2004-02-27 20:29 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 100+ messages in thread From: Rik van Riel @ 2004-02-27 19:08 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel, Andrew Morton First, let me start with one simple request. Whatever you do, please send changes upstream in small, manageable chunks so we can merge your improvements without destabilising the kernel. We should avoid the kind of disaster we had around 2.4.10... On Fri, 27 Feb 2004, Andrea Arcangeli wrote: > Then Andrew pointed out that there are complexity issues that objrmap > can't handle but I'm not concerned about the complexity issue of objrmap > since no real app will run into it We've heard the "no real app runs into it" argument before, about various other subjects. I remember using it myself, too, and every single time I used the "no real apps run into it" argument I turned out to be wrong in the end. > So my current plan is to do objrmap for all file mappings first If this can be integrated cleanly without too many bad corner cases, sure ... > then convert remap_file_pages to do the pagetable walk instead of > relying on rmap, I'm not convinced, though that could also be because I'm not sure exactly what you're planning. I'll start arguing for or against your changes here once I know exactly what they'll look like ;) > then I can go further and add a dummy inode for anonymous mappings too > during COW like DaveM did originally. Only then I can remove rmap > enterely. This last step is somewhat lower prio. Moving to a full objrmap from the current pte-rmap could well be a good thing from a code cleanliness perspective. I'm not particularly attached to rmap.c and won't be opposed to a replacement, provided that the replacement is also more or less modular with the VM so plugging in an even more improved version in the future wil be easy ;) > in small machines the current 2.4 stock algo works just fine too, it's > only when the lru has the million pages queued that without my new vm > algo you'll do million swapouts before freeing the memleak^Wcache. Same for Arjan's O(1) VM. For machines in the single and low double digit number of gigabytes of memory either would work similarly well ... > > Then again, your stuff will also find pages the moment they're > > cleaned, just at the cost of a (little?) bit more CPU time. > > exactly, that's an important effect of my patch and that's the only > thing that o1 vm is taking care of, I don't think it's enough since the > gigs of cache would still be like a memleak without my code. ... however, if you have a hundred gigabyte of memory, or even more, then you cannot afford to search the inactive list for clean pages on swapout. It will end up using too much CPU time. The FreeBSD people found this out the hard way, even on smaller systems... > > Shouldn't be too critical, unless you've got more than maybe > > a hundred GB of memory, which should be a year off. > > I think these effects starts to be visible over 8G, the worst thing is > that you can have 4G in a row of swapcache, in smaller systems the > lru tends to be more intermixed. I've even seen the problem on small systems, where I used a "smart" algorithm that freed the clean pages first and only cleaned the dirty pages later. On my 128 MB desktop system everything was smooth, until the point where the cache was gone and the system suddenly faced an inactive list entirely filled with dirty pages. Because of this, we should do some (limited) pre-cleaning of inactive pages. The key word here is "limited" ;) > I think you mean he's using an anchor in the lru too in the same way I > proposed here, but I doubt he's using it nearly as I would, there seems > to be a fundamental difference in the two algorithms, with mine partly > covering the work done by his, and not the other way around. An anchor in the lru list is definately needed. Some companies want to run Linux on systems with 256 GB or more memory. In those systems the amount of CPU time used to search the inactive list will become a problem, unless we use a smartly placed anchor. Note that I wouldn't want to use the current O(1) VM code on such a system, because the placement of the anchor isn't quite smart enough ... Lets try combining your ideas and Arjan's ideas into something that fixes all these problems. kind regards, Rik -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 19:08 ` Rik van Riel @ 2004-02-27 20:29 ` Andrew Morton 2004-02-27 20:49 ` Rik van Riel ` (3 more replies) 2004-02-27 20:31 ` Andrea Arcangeli 2004-02-29 6:34 ` Mike Fedyk 2 siblings, 4 replies; 100+ messages in thread From: Andrew Morton @ 2004-02-27 20:29 UTC (permalink / raw) To: Rik van Riel; +Cc: andrea, linux-kernel Rik van Riel <riel@redhat.com> wrote: > > First, let me start with one simple request. Whatever you do, > please send changes upstream in small, manageable chunks so we > can merge your improvements without destabilising the kernel. > > We should avoid the kind of disaster we had around 2.4.10... We need to understand that right now, 2.6.x is 2.7-pre. Once 2.7 forks off we are more at liberty to merge nasty highmem hacks which will die when 2.6 is end-of-lined. I plan to merge the 4g split immediately after 2.7 forks. I wouldn't be averse to objrmap for file-backed mappings either - I agree that the search problems which were demonstrated are unlikely to bite in real life. But first someone would need to demonstrate that pte_chains+4g/4g are for some reason unacceptable for some real-world setup. Apart from the search problem, my main gripe with objrmap is that it creates different handling for file-backed and anonymous memory. And the code which extends it to anonymous memory is complex and large. One ends up needing to seriously ask oneself what is being gained from it all. > > We've heard the "no real app runs into it" argument before, > about various other subjects. I remember using it myself, > too, and every single time I used the "no real apps run into > it" argument I turned out to be wrong in the end. > heh. > I'm not particularly attached to rmap.c and won't be opposed > to a replacement, provided that the replacement is also more > or less modular with the VM so plugging in an even more > improved version in the future wil be easy ;) Sure, let's see what it looks like. Even if it is nearly two years late. Oh, and can we please have testcases? It's all very well to assert "it sucks doing X and I fixed it" but it's a lot more useful if one can distrubute testcases as well so others can evaluate the fix and can explore alternative solutions. Andrea, this shmem problem is a case in point, please. > > in small machines the current 2.4 stock algo works just fine too, it's > > only when the lru has the million pages queued that without my new vm > > algo you'll do million swapouts before freeing the memleak^Wcache. > > Same for Arjan's O(1) VM. For machines in the single and low > double digit number of gigabytes of memory either would work > similarly well ... Case in point. We went round the O(1) page reclaim loop a year ago and I was never able to obtain a testcase which demonstrated the problem on 2.4, let alone on 2.6. I had previously found some workloads in which the 2.4 VM collapsed for similar reasons and those were fixed with the rotate_reclaimable_page() logic. Without testcases we will not be able to verify that anything else needs doing. > ... however, if you have a hundred gigabyte of memory, or > even more, then you cannot afford to search the inactive > list for clean pages on swapout. It will end up using too > much CPU time. > > The FreeBSD people found this out the hard way, even on > smaller systems... Did they have a testcase? > On my 128 MB desktop system everything was smooth, until > the point where the cache was gone and the system suddenly > faced an inactive list entirely filled with dirty pages. > > Because of this, we should do some (limited) pre-cleaning > of inactive pages. The key word here is "limited" ;) Current 2.6 will write out nr_inactive>>DEF_PRIORITY pages, will then throttle behind I/O and then will start reclaiming clean pages from the tail of the LRU which were moved there at interrupt time. > An anchor in the lru list is definately needed. Maybe not. Testcase, please ;) > Lets try combining your ideas and Arjan's ideas into > something that fixes all these problems. Did I mention testcases? ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 20:29 ` Andrew Morton @ 2004-02-27 20:49 ` Rik van Riel 2004-02-27 20:55 ` Andrew Morton ` (2 more replies) 2004-02-27 21:15 ` Andrea Arcangeli ` (2 subsequent siblings) 3 siblings, 3 replies; 100+ messages in thread From: Rik van Riel @ 2004-02-27 20:49 UTC (permalink / raw) To: Andrew Morton; +Cc: andrea, linux-kernel On Fri, 27 Feb 2004, Andrew Morton wrote: > But first someone would need to demonstrate that pte_chains+4g/4g are > for some reason unacceptable for some real-world setup. Agreed. The current 2.6 VM is well tuned already so we should be extremely cautious not to upset it. > > Same for Arjan's O(1) VM. For machines in the single and low > > double digit number of gigabytes of memory either would work > > similarly well ... > > I had previously found some workloads in which the 2.4 VM collapsed for > similar reasons and those were fixed with the rotate_reclaimable_page() > logic. Without testcases we will not be able to verify that anything else > needs doing. Duh, I forgot all about the rotate_reclaimable_page() stuff. That may well fix all problems 2.6 would have otherwise had in this area. I really hope we won't need anything like the O(1) VM stuff in 2.6, since that would leave me more time to work on other cool stuff (like resource management ;)). > > Because of this, we should do some (limited) pre-cleaning > > of inactive pages. The key word here is "limited" ;) > > Current 2.6 will write out nr_inactive>>DEF_PRIORITY pages, That may be a bit much on extremely huge systems, but that should require no more than a little tweaking to fix. Certainly no code changes should be needed ... > will then throttle behind I/O and then will start reclaiming clean pages > from the tail of the LRU which were moved there at interrupt time. That may well be much better than either the O(1) VM stuff or the stuff Andrea proposed... Forget about me proposing the O(1) VM stuff ;) -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 20:49 ` Rik van Riel @ 2004-02-27 20:55 ` Andrew Morton 2004-02-27 21:28 ` Andrea Arcangeli 2004-03-01 11:10 ` Nikita Danilov 2 siblings, 0 replies; 100+ messages in thread From: Andrew Morton @ 2004-02-27 20:55 UTC (permalink / raw) To: Rik van Riel; +Cc: andrea, linux-kernel Rik van Riel <riel@redhat.com> wrote: > > > Current 2.6 will write out nr_inactive>>DEF_PRIORITY pages, > > That may be a bit much on extremely huge systems, but that should > require no more than a little tweaking to fix. Certainly no code > changes should be needed ... hmm, with 4 million pages on the inactive list that's 1000 pages. It might be OK. Bear in mind that under usual circumstances the direct-reclaim path will refuse to block on request queue exhaustion so we might end up just scanning past some dirty pages without starting I/O against them at all. End result: some jumbling up of the LRU order. I suspect that's a second-order problem though. But hey, if we have a testcase, we can fix it! ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 20:49 ` Rik van Riel 2004-02-27 20:55 ` Andrew Morton @ 2004-02-27 21:28 ` Andrea Arcangeli 2004-02-27 21:37 ` Andrea Arcangeli 2004-02-28 3:22 ` Andrea Arcangeli 2004-03-01 11:10 ` Nikita Danilov 2 siblings, 2 replies; 100+ messages in thread From: Andrea Arcangeli @ 2004-02-27 21:28 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, linux-kernel On Fri, Feb 27, 2004 at 03:49:03PM -0500, Rik van Riel wrote: > On Fri, 27 Feb 2004, Andrew Morton wrote: > > > But first someone would need to demonstrate that pte_chains+4g/4g are > > for some reason unacceptable for some real-world setup. > > Agreed. The current 2.6 VM is well tuned already so > we should be extremely cautious not to upset it. this is very easy: 2.7*1024*1024*1024/4096*8*700/1024/1024 = 3780M at ~700 tasks 4:4 will lockup with rmap. with 3:1 and rmap the limit is down to around ~150 users. that's nothing, way too low, normal regression test on 12G uses >1k tasks. regardless the 4:4 buzzword sounds a terribly wrong idea to me for the 99% of its proposed 64G usages (and with rmap 4:4 is needed even for a 8G box, not just for the 64G box, which sounds unacceptable) my 2:2 proposal sounds to have a lot more potential technically than the 4:4. I know for sure if it was me owning that 64G hardware and running that big software, I would first of all try 2:2 and 1.7G per process and then compare with 4:4. I would like to get number comparisons for my 2:2 buzzword too but I failed so far (first time I asked was August 2003 and it was for 2.4 where 2:2 is easy too). Some resource will have to be allocated soon to test my 2:2 idea, if it turns out doing good as I expect if compared to 4:4, I won't personally need to deal with the 4:4 2.0 design (yes 2.0 design), so I try to be optimistic for this too ;). If I'm wrong, then we may be forced to allow a special 4:4 option to use on the 64G boxes. I wanted to do page clustering but there are too many other things to do first so it may be too late for 2.6 for the page clustering (for mainline is pretty much obviously too late, I was thinking at 2.6-aa here). ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 21:28 ` Andrea Arcangeli @ 2004-02-27 21:37 ` Andrea Arcangeli 2004-02-28 3:22 ` Andrea Arcangeli 1 sibling, 0 replies; 100+ messages in thread From: Andrea Arcangeli @ 2004-02-27 21:37 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, linux-kernel On Fri, Feb 27, 2004 at 10:28:44PM +0100, Andrea Arcangeli wrote: > expect if compared to 4:4, I won't personally need to deal with the 4:4 <joke> and btw, if I will have the luck of not having to deal with the 4:4 2.0 kernel slowdown, it's also because AMD is effectively saving the soul of the vm hackers ;) </joke> ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 21:28 ` Andrea Arcangeli 2004-02-27 21:37 ` Andrea Arcangeli @ 2004-02-28 3:22 ` Andrea Arcangeli 1 sibling, 0 replies; 100+ messages in thread From: Andrea Arcangeli @ 2004-02-28 3:22 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, linux-kernel On Fri, Feb 27, 2004 at 10:28:44PM +0100, Andrea Arcangeli wrote: > on the 64G boxes. I wanted to do page clustering but there are too many for the record with page clustering above I meant the patch developed originally by Hugh for 2.4.7 and then developed and currently maintained by William on kernel.org. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 20:49 ` Rik van Riel 2004-02-27 20:55 ` Andrew Morton 2004-02-27 21:28 ` Andrea Arcangeli @ 2004-03-01 11:10 ` Nikita Danilov 2 siblings, 0 replies; 100+ messages in thread From: Nikita Danilov @ 2004-03-01 11:10 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, andrea, linux-kernel Rik van Riel writes: [...] > > Duh, I forgot all about the rotate_reclaimable_page() stuff. > That may well fix all problems 2.6 would have otherwise had > in this area. > > I really hope we won't need anything like the O(1) VM stuff > in 2.6, since that would leave me more time to work on other > cool stuff (like resource management ;)). Page-out from end of the inactive list is not efficient, because pages are submitted for IO in more or less random order and this results in a lot of seeks. Test-case: replace ->writepage() with int foofs_writepage(struct page *page) { SetPageDirty(page); unlock_page(page); return 0; } and run $ time cp /tmpfs/huge-data-set /foofs File systems (and anonymous memory) want clustered write-out and VM designs with separate write-out queue (like O(1) VM) are better suited for this. Nikita. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 20:29 ` Andrew Morton 2004-02-27 20:49 ` Rik van Riel @ 2004-02-27 21:15 ` Andrea Arcangeli 2004-02-27 22:03 ` Martin J. Bligh 2004-02-27 21:42 ` Hugh Dickins 2004-02-27 23:18 ` Marcelo Tosatti 3 siblings, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-02-27 21:15 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, linux-kernel On Fri, Feb 27, 2004 at 12:29:36PM -0800, Andrew Morton wrote: > Rik van Riel <riel@redhat.com> wrote: > > > > First, let me start with one simple request. Whatever you do, > > please send changes upstream in small, manageable chunks so we > > can merge your improvements without destabilising the kernel. > > > > We should avoid the kind of disaster we had around 2.4.10... > > We need to understand that right now, 2.6.x is 2.7-pre. Once 2.7 forks off > we are more at liberty to merge nasty highmem hacks which will die when 2.6 > is end-of-lined. > > I plan to merge the 4g split immediately after 2.7 forks. I wouldn't be note that the 4:4 split is wrong in 99% of cases where people needs 64G gigs. I'm advocating strongly for the 2:2 split to everybody I talk with, I'm trying to spread the 2:2 idea because IMHO it's an order of magnitude simpler and an order of magnitude superior. Unfortunately I could get not a single number to back my 2:2 claims, since the 4:4 buzzword is spreading and people only test with 4:4. so it's pretty hard for me to spread the 2:2 buzzword. 4:4 makes no sense at all, the only advantage of 4:4 w.r.t. 2:2 is that they can map 2.7G per task of shm instead of 1.7G per task of shm. Oh they also have 1G more of normal zone, that's useless since 32G 3:1 works perfectly and more zone-normal won't help at all (and if you leave rmap kernel will lockup no matter of 4:4 or 3:1 or 2:2 or 1:3). But the 1 more gig they can map per task will give them nothing since they flush the tlb every syscall and every irq. So it's utterly stupid to map 1 more gig per task with the end result that you've to switch_mm every syscall and irq. I expect the databases will run an order of magnitude faster with _2:2_ in a 64G configuration, with _1.7G_ per process of shm mapped, instead of their 4:4 split with 2.7G (or more, up to 3.9 ;) mapped per task. I don't mind if 4:4 gets merged but I recommend db vendors to benchmark _2:2_ against 4:4 before remotely considering deploying 4:4 in production. Then of course let me know since I had not the luck to get any number back and I've no access to any 64G box. I don't care about 256G with 2:2 split, since intel and hp are now going x86-64 too. going past 32G the bigpages makes an huge difference, not just for the pte memory overhead, but for the tlb caching, this is make me very confortable of claiming 2:2 will payoff big compared to 4:4. > averse to objrmap for file-backed mappings either - I agree that the search > problems which were demonstrated are unlikely to bite in real life. cool. Martin's patch from IBM is a great start IMHO. I found a bug in the vma flags check though, VM_RESERVED should be checked too, not only VM_LOCKED, unless I'm missing something, but it's a minor issue. The other scary part is if the trylocking fails too often, would be nice to be able to spin and not to trylock, I would feel safer. In 2.4 I don't care since it's a best-effort, I don't depend on the trylocking to succeed to unmap pages, the original walk still runs and it spins. > But first someone would need to demonstrate that pte_chains+4g/4g are > for some reason unacceptable for some real-world setup. with an rmap kernel the limit goes from 150 users of 3:1 to around 700 users of 4:4. in 2.4 I can handle ~6k users at full speed with 3:1. And the 4:4 slowdown is a so big order of magnitude that I believe it's crazy to use 4:4 even on a 64G box where I advocate for 2:2 instead. And if you leave rmap in, 4:4 will be needed even on a 8G box (not only on 64G boxes) to get past 700 users. > Apart from the search problem, my main gripe with objrmap is that it > creates different handling for file-backed and anonymous memory. And the > code which extends it to anonymous memory is complex and large. One ends > up needing to seriously ask oneself what is being gained from it all. I don't have a definitive answer, but trying to use objrmap for anon too is my object. It's not clear if it worth or not though. But this is lower prio. > > We've heard the "no real app runs into it" argument before, > > about various other subjects. I remember using it myself, > > too, and every single time I used the "no real apps run into > > it" argument I turned out to be wrong in the end. > > > > heh. my answer to this is that truncate() may already be running into weird apps. sure the vm has more probability since truncate of mapped files isn't too frequent, but if you really expect bad luck we already have a window open for the bad luck ;) I try to be an optimist ;). Let's say I know at least the most important apps won't run into this. Currently the most imporant apps will lockup. So I don't have much choice. > Oh, and can we please have testcases? It's all very well to assert "it > sucks doing X and I fixed it" but it's a lot more useful if one can > distrubute testcases as well so others can evaluate the fix and can explore > alternative solutions. > > Andrea, this shmem problem is a case in point, please. I don't have the real life testcase myself (I lack both software and hardware to reproduce and it's not easy to run the thing either) but I think it's possible to test it as we move to 2.6. At the moment it's pointless to try due rmap but as soon as it's in good shape and math gives the ok I will try to get stuff tested in practice (at the moment I only verified rmap is a showstopper as math says, but just to be sure). We can write a testcase ourself, it's pretty easy, just create a 2.7G file in /dev/shm, and mmap(MAP_SHARED) it from 1k processes and fault in all the pagetables from all tasks touching the shm vma. Then run a second copy until the machine starts swapping and see how thing goes. To do this you need probably 8G, this is why I didn't write the testcase myself yet ;). maybe I can simulate with less shm and less tasks on 1G boxes too, but the extreme lru effects of point 3 won't be visibile there, the very same software configuration works fine on 1/2G boxes on stock 2.4. problems showsup when the lru grows due the algorithm not contemplating million of dirty swapcache in a row at the end of the lru and some gigs of free cache ad the head of the lru. the rmap-only issues can also be tested with math, no testcase is needed for that. > Maybe not. Testcase, please ;) I think a more efficient algorithm to achive the o1 vm object (which is only one of the issues that my point 3 solves), could be to have a separate lru (not an anchor) protected by an irq spinlock (spinlock because it will be accessed by the I/O completion routine) that we check once per second and not more, so the variable becomes "time" instead of "frequency of allocations", since swapout disk I/O is going to be quite dependent on fixed time, rather than on the allocation rate which isn't really fixed. This way we know we won't throw a too huge amount of cpu on locked pages and it avoids the anchor and in turn the anchor placement non obvious problem. However the coding of this would be complex, maybe not more complex than the o1 vm code though. The usage I wanted to do of an anchor in 2.4 is completely different from the usage that o1 vm is doing IMHO. thanks for the help! ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 21:15 ` Andrea Arcangeli @ 2004-02-27 22:03 ` Martin J. Bligh 2004-02-27 22:23 ` Andrew Morton 2004-02-28 2:32 ` Andrea Arcangeli 0 siblings, 2 replies; 100+ messages in thread From: Martin J. Bligh @ 2004-02-27 22:03 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton; +Cc: Rik van Riel, linux-kernel > note that the 4:4 split is wrong in 99% of cases where people needs 64G > gigs. I'm advocating strongly for the 2:2 split to everybody I talk > with, I'm trying to spread the 2:2 idea because IMHO it's an order of > magnitude simpler and an order of magnitude superior. Unfortunately I > could get not a single number to back my 2:2 claims, since the 4:4 > buzzword is spreading and people only test with 4:4. so it's pretty hard > for me to spread the 2:2 buzzword. For the record, I for one am not opposed to doing 2:2 instead of 4:4. What pisses me off is people trying to squeeze large amounts of memory into 3:1, and distros pretending it's supportable, when it's never stable across a broad spectrum of workloads. Between 2:2 and 4:4, it's just a different overhead tradeoff. > 4:4 makes no sense at all, the only advantage of 4:4 w.r.t. 2:2 is that > they can map 2.7G per task of shm instead of 1.7G per task of shm. Eh? You have a 2GB difference of user address space, and a 1GB difference of shm size. You lost a GB somewhere ;-) Depending on whether you move TASK_UNMAPPPED_BASE or not, it you might mean 2.7 vs 0.7 or at a pinch 3.5 vs 1.5, I'm not sure. > syscall and irq. I expect the databases will run an order of magnitude > faster with _2:2_ in a 64G configuration, with _1.7G_ per process of shm > mapped, instead of their 4:4 split with 2.7G (or more, up to 3.9 ;) > mapped per task. That may well be true for some workloads, I suspect it's slower for others. One could call the tradeoff either way. > I don't mind if 4:4 gets merged but I recommend db vendors to benchmark > _2:2_ against 4:4 before remotely considering deploying 4:4 in > production. Then of course let me know since I had not the luck to get > any number back and I've no access to any 64G box. If you send me a *simple* simulation test, I'll gladly run it for you ;-) But I'm not going to go fiddle with Oracle, and thousands of disks ;-) > I don't care about 256G with 2:2 split, since intel and hp are now going > x86-64 too. Yeah, I don't think we ever need to deal with that kind of insanity ;-) >> averse to objrmap for file-backed mappings either - I agree that the search >> problems which were demonstrated are unlikely to bite in real life. > > cool. > > Martin's patch from IBM is a great start IMHO. I found a bug in the vma > flags check though, VM_RESERVED should be checked too, not only > VM_LOCKED, unless I'm missing something, but it's a minor issue. I didn't actually write it - that was Dave McCracken ;-) I just suggested the partial aproach (because I'm dirty and lazy ;-)) and carried it in my tree. I agree with Andrew's comments though - it's not nice having the dual approach of the partial, but the complexity of the full approach is a bit scary and buys you little in real terms (performance and space). I still believe that creating an "address_space like structure" for anon memory, shared across VMAs is an idea that might give us cleaner code - it also fixes other problems like Andi's NUMA API binding. > We can write a testcase ourself, it's pretty easy, just create a 2.7G > file in /dev/shm, and mmap(MAP_SHARED) it from 1k processes and fault in > all the pagetables from all tasks touching the shm vma. Then run a > second copy until the machine starts swapping and see how thing goes. To > do this you need probably 8G, this is why I didn't write the testcase > myself yet ;). maybe I can simulate with less shm and less tasks on 1G > boxes too, but the extreme lru effects of point 3 won't be visibile > there, the very same software configuration works fine on 1/2G boxes on > stock 2.4. problems showsup when the lru grows due the algorithm not > contemplating million of dirty swapcache in a row at the end of the lru > and some gigs of free cache ad the head of the lru. the rmap-only issues > can also be tested with math, no testcase is needed for that. I don't have time at the moment to go write it at the moment, but I can certainly run it on large end hardware if that helps. M. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 22:03 ` Martin J. Bligh @ 2004-02-27 22:23 ` Andrew Morton 2004-02-28 2:32 ` Andrea Arcangeli 1 sibling, 0 replies; 100+ messages in thread From: Andrew Morton @ 2004-02-27 22:23 UTC (permalink / raw) To: Martin J. Bligh; +Cc: andrea, riel, linux-kernel "Martin J. Bligh" <mbligh@aracnet.com> wrote: > > > We can write a testcase ourself, it's pretty easy, just create a 2.7G > > file in /dev/shm, and mmap(MAP_SHARED) it from 1k processes and fault in > > all the pagetables from all tasks touching the shm vma. Then run a > > second copy until the machine starts swapping and see how thing goes. To > > do this you need probably 8G, this is why I didn't write the testcase > > myself yet ;). maybe I can simulate with less shm and less tasks on 1G > > boxes too, but the extreme lru effects of point 3 won't be visibile > > there, the very same software configuration works fine on 1/2G boxes on > > stock 2.4. problems showsup when the lru grows due the algorithm not > > contemplating million of dirty swapcache in a row at the end of the lru > > and some gigs of free cache ad the head of the lru. the rmap-only issues > > can also be tested with math, no testcase is needed for that. > > I don't have time at the moment to go write it at the moment, but I can certainly run it on large end hardware if that helps. I think just usemem -m 2700 -f test-file -r 10 -n 1000 will do it. I need to verify that. http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 22:03 ` Martin J. Bligh 2004-02-27 22:23 ` Andrew Morton @ 2004-02-28 2:32 ` Andrea Arcangeli 2004-02-28 4:57 ` Wim Coekaerts 2004-02-28 6:10 ` Martin J. Bligh 1 sibling, 2 replies; 100+ messages in thread From: Andrea Arcangeli @ 2004-02-28 2:32 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrew Morton, Rik van Riel, linux-kernel > > 4:4 makes no sense at all, the only advantage of 4:4 w.r.t. 2:2 is that > > they can map 2.7G per task of shm instead of 1.7G per task of shm. On Fri, Feb 27, 2004 at 02:03:07PM -0800, Martin J. Bligh wrote: > > Eh? You have a 2GB difference of user address space, and a 1GB difference > of shm size. You lost a GB somewhere ;-) Depending on whether you move > TASK_UNMAPPPED_BASE or not, it you might mean 2.7 vs 0.7 or at a pinch > 3.5 vs 1.5, I'm not sure. the numbers I wrote are right. No shm size is lost. The shm size is >20G, it doesn't fit in 4g of address space of 4:4 like it doesn't fit in 3G of address space of 3:1 like it doesn't fit in 2:2. I think nobody tested 2:2 seriously on 64G boxes yet, I'm simply asking for that. And I agree with you using 64G with 3:1 is not feasible for application like databases, it's feasible for other apps for example needing big caches (if you can manage to boot the machine ;) it's not a matter of opinion, it's a matter fact, for a generic misc load the high limit of 3:1 is mem=48G, which is not too bad. What changes between 3:1 and 2:2 is the "view" on the 20G shm file, not the size of the shm. you can do less simultaneous mmap with a 1.7G view instead of a 2.7G view. the nonlinear vma will be 1.7G in size with 2:2, instead of 2.7G in size with 3:1 or 4:4 (300M are as usual left for some hole, the binary itself and the stack) > > syscall and irq. I expect the databases will run an order of magnitude > > faster with _2:2_ in a 64G configuration, with _1.7G_ per process of shm > > mapped, instead of their 4:4 split with 2.7G (or more, up to 3.9 ;) > > mapped per task. > > That may well be true for some workloads, I suspect it's slower for others. > One could call the tradeoff either way. the only chance it's faster is if you never use syscalls and you drive all interrupts to other cpus and you have an advantage by mapping >2G in the same address space. If you use syscalls and irqs, then you'll keep flushing the address space, so you can as well use mmap and flush _by_hand_ only the interesting bits when you really run into a view-miss, so you can run at full speed in the fast path including syscalls and irqs. Most of the time the view will be enough, there's some aging technic to apply on the collection of the old buckets too. So I've some doubt 4:4 runs faster anywhere. I could be wrong though. > > I don't mind if 4:4 gets merged but I recommend db vendors to benchmark > > _2:2_ against 4:4 before remotely considering deploying 4:4 in > > production. Then of course let me know since I had not the luck to get > > any number back and I've no access to any 64G box. > > If you send me a *simple* simulation test, I'll gladly run it for you ;-) > But I'm not going to go fiddle with Oracle, and thousands of disks ;-) :) thanks for the offer! ;) I would prefer a real life db bench since syscalls and irqs are an important part of the load that hurts 4:4 most, it doesn't need to be necessairly oracle though. And if it's a cpu with big tlb cache like p4 it would be prefereable. maybe we should talk about this offline. > > I don't care about 256G with 2:2 split, since intel and hp are now going > > x86-64 too. > > Yeah, I don't think we ever need to deal with that kind of insanity ;-) ;) > >> averse to objrmap for file-backed mappings either - I agree that the search > >> problems which were demonstrated are unlikely to bite in real life. > > > > cool. > > > > Martin's patch from IBM is a great start IMHO. I found a bug in the vma > > flags check though, VM_RESERVED should be checked too, not only > > VM_LOCKED, unless I'm missing something, but it's a minor issue. > > I didn't actually write it - that was Dave McCracken ;-) I just suggested > the partial aproach (because I'm dirty and lazy ;-)) and carried it > in my tree. I know you didn't write it but I forgot who was the author so I just given credit to IBM at large ;). thanks for giving the due credit to Dave ;) > I agree with Andrew's comments though - it's not nice having the dual > approach of the partial, but the complexity of the full approach is a > bit scary and buys you little in real terms (performance and space). > I still believe that creating an "address_space like structure" for > anon memory, shared across VMAs is an idea that might give us cleaner > code - it also fixes other problems like Andi's NUMA API binding. agreed. It's just lower prio at the moment since anon memory doesn't tend to be that much shared, so the overhead is minimal. > I don't have time at the moment to go write it at the moment, but I > can certainly run it on large end hardware if that helps. thanks, we should write it someday. that testcase isn't the one suitable for the 4:4 vs 2:2 thing though, for that a real life thing is needed since irqs, syscalls (and possibly page faults but not that many with a db) are fundamental parts of the load. we could write a smarter testcase as well, but I guess using a db is simpler, evaluating 2:2 vs 4:4 is more a do-once thing, results won't change over time. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-28 2:32 ` Andrea Arcangeli @ 2004-02-28 4:57 ` Wim Coekaerts 2004-02-28 6:18 ` Andrea Arcangeli 2004-02-28 6:10 ` Martin J. Bligh 1 sibling, 1 reply; 100+ messages in thread From: Wim Coekaerts @ 2004-02-28 4:57 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin J. Bligh, Andrew Morton, Rik van Riel, linux-kernel On Sat, Feb 28, 2004 at 03:32:36AM +0100, Andrea Arcangeli wrote: > On Fri, Feb 27, 2004 at 02:03:07PM -0800, Martin J. Bligh wrote: > > > > Eh? You have a 2GB difference of user address space, and a 1GB difference > > of shm size. You lost a GB somewhere ;-) Depending on whether you move > > TASK_UNMAPPPED_BASE or not, it you might mean 2.7 vs 0.7 or at a pinch > > 3.5 vs 1.5, I'm not sure. > > the numbers I wrote are right. No shm size is lost. The shm size is >20G, > it doesn't fit in 4g of address space of 4:4 like it doesn't fit in 3G > of address space of 3:1 like it doesn't fit in 2:2. Andrea, one thing I don't think we have discussed before is that aside from mapping into shmfs or hugetlbfs, there is also the regular shmem segment (shmget) we always use. the way we currently allocate memory is like this : just a big shmem segment w/ shmget() up to like 1.7 or 2.5 gb, containing the entire in memory part or shm (reasoanble sized segment, between 400mb and today on 32bit up to like 1.7 - 2 gb) which is used for non buffercache (sqlcache, parse trees etc) a default of about 16000 mmaps into the shmfs file (or remap_file_pages) and the total size ranging from a few gb to many gb which contains the data buffer cache we cannot put the sqlcache(shared pool) into shmfs and do the windowing and this is a big deal for performance as well. eg the larger the better. it would have to be able to get to a reasonable size, and you have about 512mb on top of that for the window into shmfs. average sizes range between 1gb and 1.7gb so a 2/2 split would not be useful here. sql/plsql/java cache is quite important for certain things. I think Van is running a test on a32gb box to compare the 2 but I think that would be too limiting in general to have only 2gb. wim ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-28 4:57 ` Wim Coekaerts @ 2004-02-28 6:18 ` Andrea Arcangeli 2004-02-28 6:45 ` Martin J. Bligh [not found] ` <20040228061838.GO8834@dualathlon.random.suse.lists.linux.kernel> 0 siblings, 2 replies; 100+ messages in thread From: Andrea Arcangeli @ 2004-02-28 6:18 UTC (permalink / raw) To: Wim Coekaerts; +Cc: Martin J. Bligh, Andrew Morton, Rik van Riel, linux-kernel On Fri, Feb 27, 2004 at 08:57:14PM -0800, Wim Coekaerts wrote: > On Sat, Feb 28, 2004 at 03:32:36AM +0100, Andrea Arcangeli wrote: > > On Fri, Feb 27, 2004 at 02:03:07PM -0800, Martin J. Bligh wrote: > > > > > > Eh? You have a 2GB difference of user address space, and a 1GB difference > > > of shm size. You lost a GB somewhere ;-) Depending on whether you move > > > TASK_UNMAPPPED_BASE or not, it you might mean 2.7 vs 0.7 or at a pinch > > > 3.5 vs 1.5, I'm not sure. > > > > the numbers I wrote are right. No shm size is lost. The shm size is >20G, > > it doesn't fit in 4g of address space of 4:4 like it doesn't fit in 3G > > of address space of 3:1 like it doesn't fit in 2:2. > > Andrea, one thing I don't think we have discussed before is that aside > from mapping into shmfs or hugetlbfs, there is also the regular shmem > segment (shmget) we always use. the way we currently allocate memory is > like this : > > just a big shmem segment w/ shmget() up to like 1.7 or 2.5 gb, > containing the entire in memory part > > or > > shm (reasoanble sized segment, between 400mb and today on 32bit up to > like 1.7 - 2 gb) which is used for non buffercache (sqlcache, parse > trees etc) > a default of about 16000 mmaps into the shmfs file (or > remap_file_pages) and the total size ranging from a few gb to many gb > which contains the data buffer cache > > we cannot put the sqlcache(shared pool) into shmfs and do the windowing > and this is a big deal for performance as well. eg the larger the > better. it would have to be able to get to a reasonable size, and you > have about 512mb on top of that for the window into shmfs. average sizes > range between 1gb and 1.7gb so a 2/2 split would not be useful here. > sql/plsql/java cache is quite important for certain things. I see, so losing 1g sounds too much. > > I think Van is running a test on a32gb box to compare the 2 but I think > that would be too limiting in general to have only 2gb. thanks for giving it a spin (btw I assume it's 2.4, that's fine for a quick test, and I seem not to find the 2:2 and 1:3 options in the 2.6 kernel anymore ;). What I probably didn't specify yet is that 2.5:1.5 is feasible too, I've a fairly small and strightforward patch here from ibm that implements 3.5:0.5 for PAE mode (for a completely different matter, but I mean, it's not really a problem to do 2.5:1.5 either if needed, it's the same as the current PAE mode 3.5:0.5). starting with the assumtion that 32G machines works with 3:1 (like they do in 2.4), and assuming the size of a page is 48 bytes (like in 2.4, in 2.6 it's a bit bigger but we can most certainly shrink it, for example removing rmap for anon pages will immediatly release 128M of kernel memory), moving from 32G to 64G means losing 384M of those additional 512M in pages, you can use the remaining additional 512M-384M=128M for vmas, task structs, files etc... So 2.5:1.5 should be enough as far as the kernel is concerned to run on 64G machines (provided the page_t is not bigger than 2.4 which sounds feasible too). we can add a config option to enable together with 2.5:1.5 to drop the gap page in vmalloc, and to reduce the vmalloc space, so that we can sneak another few "free" dozen megs back for the 64G kernel just to get more margin even if we don't strictly need it. (btw, the vmalloc space is also tunable at boot, so this config option would just change the default value) So as far as 32G works with 3:1, 2.5:1.5 is going to be more than enough to handle 64G. the question remains is if you can live with only 2.5G of address space, so if losing 512m is a blocker or not. I see losing 1G was way too much, but kernel doesn't need 1G more, an additional 512m is enough to make the kernel happy. If losing 512m is a big problem too, I don't think we can drop from userspace less than 512m of address space, so 4:4 would remain the only way to handle 64G and we can forget about this my suggestion. Certainly I believe 2.5:1.5 has a very good chance to significantly outperform 4:4 if you can make the ipc shm 1.7G and the window on the shm 512m (that leaves you 300m for holes, stack, binary, anonymous memory and similar minor allocations). thanks. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-28 6:18 ` Andrea Arcangeli @ 2004-02-28 6:45 ` Martin J. Bligh 2004-02-28 7:05 ` Andrea Arcangeli [not found] ` <20040228061838.GO8834@dualathlon.random.suse.lists.linux.kernel> 1 sibling, 1 reply; 100+ messages in thread From: Martin J. Bligh @ 2004-02-28 6:45 UTC (permalink / raw) To: Andrea Arcangeli, Wim Coekaerts, Hugh Dickins Cc: Andrew Morton, Rik van Riel, linux-kernel > thanks for giving it a spin (btw I assume it's 2.4, that's fine for > a quick test, and I seem not to find the 2:2 and 1:3 options in the 2.6 > kernel anymore ;). ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/patches/2.6.3/2.6.3-mjb1/212-config_page_offset (which sits on top of the 4/4 patches, so might need some massaging to apply) > What I probably didn't specify yet is that 2.5:1.5 is feasible too, I've > a fairly small and strightforward patch here from ibm that implements > 3.5:0.5 for PAE mode (for a completely different matter, but I mean, > it's not really a problem to do 2.5:1.5 either if needed, it's the same > as the current PAE mode 3.5:0.5). I'm not sure it's that straightforward really - doing the non-pgd aligned split is messy. 2.5 might actually be much cleaner than 3.5 though, as we never updated the mappings of the PMD that's shared between user and kernel. Hmmm ... that's quite tempting. > starting with the assumtion that 32G machines works with 3:1 (like they > do in 2.4), and assuming the size of a page is 48 bytes (like in 2.4, in > 2.6 it's a bit bigger but we can most certainly shrink it, for example > removing rmap for anon pages will immediatly release 128M of kernel > memory), moving from 32G to 64G means losing 384M of those additional > 512M in pages, you can use the remaining additional 512M-384M=128M for > vmas, task structs, files etc... So 2.5:1.5 should be enough as far as > the kernel is concerned to run on 64G machines (provided the page_t is > not bigger than 2.4 which sounds feasible too). Shrinking struct page sounds nice. Did Hugh's patch actually end up doing that? I don't recall that, but I don't see why it wouldn't. M. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-28 6:45 ` Martin J. Bligh @ 2004-02-28 7:05 ` Andrea Arcangeli 2004-02-28 9:19 ` Dave Hansen 0 siblings, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-02-28 7:05 UTC (permalink / raw) To: Martin J. Bligh Cc: Wim Coekaerts, Hugh Dickins, Andrew Morton, Rik van Riel, linux-kernel On Fri, Feb 27, 2004 at 10:45:21PM -0800, Martin J. Bligh wrote: > > > thanks for giving it a spin (btw I assume it's 2.4, that's fine for > > a quick test, and I seem not to find the 2:2 and 1:3 options in the 2.6 > > kernel anymore ;). > > ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/patches/2.6.3/2.6.3-mjb1/212-config_page_offset > (which sits on top of the 4/4 patches, so might need some massaging to apply) thanks for maintaining this bit too, very helpful! > > What I probably didn't specify yet is that 2.5:1.5 is feasible too, I've > > a fairly small and strightforward patch here from ibm that implements > > 3.5:0.5 for PAE mode (for a completely different matter, but I mean, > > it's not really a problem to do 2.5:1.5 either if needed, it's the same > > as the current PAE mode 3.5:0.5). > > I'm not sure it's that straightforward really - doing the non-pgd aligned > split is messy. 2.5 might actually be much cleaner than 3.5 though, as we > never updated the mappings of the PMD that's shared between user and kernel. > Hmmm ... that's quite tempting. I read the 3.5:0.5 PAE sometime last year and it was pretty strightforward too, the only single reason I didn't merge it is that it had the problem that it changed common code that every archs depends on, so it broke all other archs, but it's not really a matter of difficult code, as worse it just needs a few liner change in every arch to make them compile again. So I'm quite optimistic 2.5:1.5 will be doable with a reasonably clean patch and with ~zero performance downside compared to 3:1 and 2:2. In the meantime testing 2:2 against 4:4 (with a very/too reduced ipcshm in the 2:2 test) still sounds very interesting. > > starting with the assumtion that 32G machines works with 3:1 (like they > > do in 2.4), and assuming the size of a page is 48 bytes (like in 2.4, in > > 2.6 it's a bit bigger but we can most certainly shrink it, for example > > removing rmap for anon pages will immediatly release 128M of kernel > > memory), moving from 32G to 64G means losing 384M of those additional > > 512M in pages, you can use the remaining additional 512M-384M=128M for > > vmas, task structs, files etc... So 2.5:1.5 should be enough as far as > > the kernel is concerned to run on 64G machines (provided the page_t is > > not bigger than 2.4 which sounds feasible too). > > Shrinking struct page sounds nice. Did Hugh's patch actually end up doing > that? I don't recall that, but I don't see why it wouldn't. full objrmap can certainly release 8 bytes per page, 128M total, so quite an huge amount of ram (that is also why I'd like to do the full objrmap and not only to stop at the file mappings ;). ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-28 7:05 ` Andrea Arcangeli @ 2004-02-28 9:19 ` Dave Hansen 2004-03-18 2:44 ` Andrea Arcangeli 0 siblings, 1 reply; 100+ messages in thread From: Dave Hansen @ 2004-02-28 9:19 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin J. Bligh, Wim Coekaerts, Hugh Dickins, Andrew Morton, Rik van Riel, Linux Kernel Mailing List On Fri, 2004-02-27 at 23:05, Andrea Arcangeli wrote: > > I'm not sure it's that straightforward really - doing the non-pgd aligned > > split is messy. 2.5 might actually be much cleaner than 3.5 though, as we > > never updated the mappings of the PMD that's shared between user and kernel. > > Hmmm ... that's quite tempting. > > I read the 3.5:0.5 PAE sometime last year and it was pretty > strightforward too, the only single reason I didn't merge it is that > it had the problem that it changed common code that every archs depends > on, so it broke all other archs, but it's not really a matter of > difficult code, as worse it just needs a few liner change in every arch > to make them compile again. So I'm quite optimistic 2.5:1.5 will be > doable with a reasonably clean patch and with ~zero performance downside > compared to 3:1 and 2:2. The only performance problem with using PMDs which are shared between kernel and user PTE pages is that you have a potential to be required to instantiate the kernel portion of the shared PMD each time you need a new set of page tables. A slab for these partial PMDs is quite helpful in this case. The real logistical problem with partial PMDs is just making sure that all of the 0 ... PTRS_PER_PMD loops are correct. The last few times I've implemented it, I just made PTRS_PER_PMD take a PGD index, and made sure to start all of the loops from things like pmd_index(PAGE_OFFSET) instead of 0. Here are a couple of patches that allowed partial user/kernel PMDs. These conflicted with 4:4 and got dropped somewhere along the way, but the generic approaches worked. I believe they at least compiled on all of the arches, too. ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/patches/2.5.68/2.5.68-mjb1/540-separate_pmd ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/patches/2.5.68/2.5.68-mjb1/650-banana_split -- dave ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-28 9:19 ` Dave Hansen @ 2004-03-18 2:44 ` Andrea Arcangeli 0 siblings, 0 replies; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-18 2:44 UTC (permalink / raw) To: Dave Hansen Cc: Martin J. Bligh, Wim Coekaerts, Hugh Dickins, Andrew Morton, Rik van Riel, Linux Kernel Mailing List On Sat, Feb 28, 2004 at 01:19:01AM -0800, Dave Hansen wrote: > On Fri, 2004-02-27 at 23:05, Andrea Arcangeli wrote: > > > I'm not sure it's that straightforward really - doing the non-pgd aligned > > > split is messy. 2.5 might actually be much cleaner than 3.5 though, as we > > > never updated the mappings of the PMD that's shared between user and kernel. > > > Hmmm ... that's quite tempting. > > > > I read the 3.5:0.5 PAE sometime last year and it was pretty > > strightforward too, the only single reason I didn't merge it is that > > it had the problem that it changed common code that every archs depends > > on, so it broke all other archs, but it's not really a matter of > > difficult code, as worse it just needs a few liner change in every arch > > to make them compile again. So I'm quite optimistic 2.5:1.5 will be > > doable with a reasonably clean patch and with ~zero performance downside > > compared to 3:1 and 2:2. > > The only performance problem with using PMDs which are shared between > kernel and user PTE pages is that you have a potential to be required to > instantiate the kernel portion of the shared PMD each time you need a > new set of page tables. A slab for these partial PMDs is quite helpful > in this case. that's a bigger cost during context switch but it's still zero cost for the syscalls, and it never flushes away the user address space unnecessairly. So I doubt it's measurable (unlike 4:4 which is a big hit). > The real logistical problem with partial PMDs is just making sure that > all of the 0 ... PTRS_PER_PMD loops are correct. The last few times > I've implemented it, I just made PTRS_PER_PMD take a PGD index, and made > sure to start all of the loops from things like pmd_index(PAGE_OFFSET) > instead of 0. it is indeed tricky, though your last patch for 3.5G on PAE looked fine. But now I would like to include the 2.5:1.5 not 3.5:0.5 ;), maybe we can support 3.5:0.5 too at the same time (though 3.5:0.5 is secondary). > Here are a couple of patches that allowed partial user/kernel PMDs. > These conflicted with 4:4 and got dropped somewhere along the way, but > the generic approaches worked. I believe they at least compiled on all > of the arches, too. > > ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/patches/2.5.68/2.5.68-mjb1/540-separate_pmd > ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/patches/2.5.68/2.5.68-mjb1/650-banana_split would you be willing to implement config 15GB too (2.5:1.5)? In the next days I'm going to work on the rbtree for the objrmap (just in case somebody wants to swap the shm with vlm instead of mlocking it), but I would like to get this done too ;). you see my current tree in 2.6.5-rc1-aa1 on the ftp site, but you can use any other kernel too since the code you will touch should be the same for all 2.6. It's up to you, only if you are interested, thanks. ^ permalink raw reply [flat|nested] 100+ messages in thread
[parent not found: <20040228061838.GO8834@dualathlon.random.suse.lists.linux.kernel>]
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) [not found] ` <20040228061838.GO8834@dualathlon.random.suse.lists.linux.kernel> @ 2004-02-28 12:46 ` Andi Kleen 2004-02-29 1:39 ` Andrea Arcangeli 0 siblings, 1 reply; 100+ messages in thread From: Andi Kleen @ 2004-02-28 12:46 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel Andrea Arcangeli <andrea@suse.de> writes: > > we can add a config option to enable together with 2.5:1.5 to drop the > gap page in vmalloc, and to reduce the vmalloc space, so that we can > sneak another few "free" dozen megs back for the 64G kernel just to get > more margin even if we don't strictly need it. (btw, the vmalloc space > is also tunable at boot, so this config option would just change the > default value) Not sure if that would help, but you could relatively easily save 8 bytes on 32bit for each vma too. Replace vm_next with rb_next() and move vm_rb.color into vm_flags. It would be a lot of editing work though. NUMA API will add new 4 bytes again. -Andi ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-28 12:46 ` Andi Kleen @ 2004-02-29 1:39 ` Andrea Arcangeli 2004-02-29 2:29 ` Andi Kleen 0 siblings, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-02-29 1:39 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel On Sat, Feb 28, 2004 at 01:46:47PM +0100, Andi Kleen wrote: > Andrea Arcangeli <andrea@suse.de> writes: > > > > we can add a config option to enable together with 2.5:1.5 to drop the > > gap page in vmalloc, and to reduce the vmalloc space, so that we can > > sneak another few "free" dozen megs back for the 64G kernel just to get > > more margin even if we don't strictly need it. (btw, the vmalloc space > > is also tunable at boot, so this config option would just change the > > default value) > > Not sure if that would help, but you could relatively easily save > 8 bytes on 32bit for each vma too. Replace vm_next with rb_next() > and move vm_rb.color into vm_flags. It would be a lot of editing the vm_flags rb_color thing is a smart idea indeed, I never thought about it using vm_flags itself, however it clearly needs a generic wrapper since we want to keep the rbtree completely generic. David Woodhouse once suggested me to use the least significant bit of one of the pointers to save the rb_color, that could work but that really messes the code up since such a pointer would need to be masked every time, and it's not self contained. Using vm_flags sounds more interesting since the pointers are still usable in raw mode, one only needs to be careful about the locking: vm_flags seems pretty much a readonly thing so it's probably ok, if there would be other writers outside the rbtree code then we'd need to sure they're serialized. you're wrong about s/vm_next/rb_next()/, walking the tree like in get_unmapped_area would require recurisve algos w/o vm_next, or significant heap allocations. that's the only thing vm_next is needed for (i.e. to walk the tree in order efficiently). only if we drop all tree walks than we can nuke vm_next. > work though. NUMA API will add new 4 bytes again. saving in vmas is partly already accomplished by remap_file_pages, so I don't rate vma size as critical. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-29 1:39 ` Andrea Arcangeli @ 2004-02-29 2:29 ` Andi Kleen 2004-02-29 16:34 ` Andrea Arcangeli 0 siblings, 1 reply; 100+ messages in thread From: Andi Kleen @ 2004-02-29 2:29 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel On Sun, 29 Feb 2004 02:39:24 +0100 Andrea Arcangeli <andrea@suse.de> wrote: > you're wrong about s/vm_next/rb_next()/, walking the tree like in > get_unmapped_area would require recurisve algos w/o vm_next, or > significant heap allocations. that's the only thing vm_next is needed > for (i.e. to walk the tree in order efficiently). only if we drop all > tree walks than we can nuke vm_next. Not sure what you mean here. rb_next() is not recursive. -Andi ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-29 2:29 ` Andi Kleen @ 2004-02-29 16:34 ` Andrea Arcangeli 0 siblings, 0 replies; 100+ messages in thread From: Andrea Arcangeli @ 2004-02-29 16:34 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel On Sun, Feb 29, 2004 at 03:29:47AM +0100, Andi Kleen wrote: > On Sun, 29 Feb 2004 02:39:24 +0100 > Andrea Arcangeli <andrea@suse.de> wrote: > > > you're wrong about s/vm_next/rb_next()/, walking the tree like in > > get_unmapped_area would require recurisve algos w/o vm_next, or > > significant heap allocations. that's the only thing vm_next is needed > > for (i.e. to walk the tree in order efficiently). only if we drop all > > tree walks than we can nuke vm_next. > > Not sure what you mean here. rb_next() is not recursive. if you don't allocate the memory with recursion-like algos, you'll trow too much cpu in a loop like this with rb_next. so it worth to keep vm_next for performance reasons or for memory allocation reasons. for (vma = find_vma(mm, addr); ; vma = vma->vm_next) { ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-28 2:32 ` Andrea Arcangeli 2004-02-28 4:57 ` Wim Coekaerts @ 2004-02-28 6:10 ` Martin J. Bligh 2004-02-28 6:43 ` Andrea Arcangeli 2004-03-02 9:10 ` Kurt Garloff 1 sibling, 2 replies; 100+ messages in thread From: Martin J. Bligh @ 2004-02-28 6:10 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Rik van Riel, linux-kernel >> > 4:4 makes no sense at all, the only advantage of 4:4 w.r.t. 2:2 is that >> > they can map 2.7G per task of shm instead of 1.7G per task of shm. > > On Fri, Feb 27, 2004 at 02:03:07PM -0800, Martin J. Bligh wrote: >> >> Eh? You have a 2GB difference of user address space, and a 1GB difference >> of shm size. You lost a GB somewhere ;-) Depending on whether you move >> TASK_UNMAPPPED_BASE or not, it you might mean 2.7 vs 0.7 or at a pinch >> 3.5 vs 1.5, I'm not sure. > > the numbers I wrote are right. No shm size is lost. The shm size is >20G, > it doesn't fit in 4g of address space of 4:4 like it doesn't fit in 3G > of address space of 3:1 like it doesn't fit in 2:2. OK, I understand you can window it, but I still don't get where your figures of 2.7GB/task vs 1.7GB per task come from? > I think nobody tested 2:2 seriously on 64G boxes yet, I'm simply asking > for that. > > And I agree with you using 64G with 3:1 is not feasible for application > like databases, it's feasible for other apps for example needing big > caches (if you can manage to boot the machine ;) it's not a matter of > opinion, it's a matter fact, for a generic misc load the high limit of > 3:1 is mem=48G, which is not too bad. 48GB is sailing damned close to the wind. The problem I've had before is distros saying "we support X GB of RAM", but it only works for some workloads, and falls over on others. Oddly enough, that tends to upset the customers quite a bit ;-) I'd agree with what you say - for a generic misc load, it might work ... but I'd hate a customer to hear that and misinterpret it. > What changes between 3:1 and 2:2 is the "view" on the 20G shm file, not > the size of the shm. you can do less simultaneous mmap with a 1.7G view > instead of a 2.7G view. the nonlinear vma will be 1.7G in size with 2:2, > instead of 2.7G in size with 3:1 or 4:4 (300M are as usual left for some > hole, the binary itself and the stack) Why is it 2.7GB with both 3:1 and 4:4 ... surely it can get bigger on 4:4 ??? > the only chance it's faster is if you never use syscalls and you drive > all interrupts to other cpus and you have an advantage by mapping >2G in > the same address space. I think that's the key - when you need to map a LOT of data into the address space. Unfortunately, I think that's the kind of app that the large machines run. > I've some doubt 4:4 runs faster anywhere. I could be wrong though. There's only one real way to tell ;-) >> If you send me a *simple* simulation test, I'll gladly run it for you ;-) >> But I'm not going to go fiddle with Oracle, and thousands of disks ;-) > > :) > > thanks for the offer! ;) I would prefer a real life db bench since > syscalls and irqs are an important part of the load that hurts 4:4 most, > it doesn't need to be necessairly oracle though. And if it's a cpu with > big tlb cache like p4 it would be prefereable. maybe we should talk > about this offline. I've been talking with others here about running a database workload test, but it'll probably be on a machine with only 8GB or so. I still think that's enough to show us something interesting. > agreed. It's just lower prio at the moment since anon memory doesn't > tend to be that much shared, so the overhead is minimal. Yup, that's what my analysis found, most of it falls under the pte_direct optimisation. The only problem seems to be that at fork/exec time we set up the chain, then tear it down again, which is ugly. That's the bit where I like Hugh's stuff. >> I don't have time at the moment to go write it at the moment, but I >> can certainly run it on large end hardware if that helps. > > thanks, we should write it someday. that testcase isn't the one suitable > for the 4:4 vs 2:2 thing though, for that a real life thing is needed > since irqs, syscalls (and possibly page faults but not that many with a > db) are fundamental parts of the load. we could write a smarter > testcase as well, but I guess using a db is simpler, evaluating 2:2 vs > 4:4 is more a do-once thing, results won't change over time. OK, I'll see what people here can do about that ;-) M. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-28 6:10 ` Martin J. Bligh @ 2004-02-28 6:43 ` Andrea Arcangeli 2004-02-28 7:00 ` Martin J. Bligh 2004-03-02 9:10 ` Kurt Garloff 1 sibling, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-02-28 6:43 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrew Morton, Rik van Riel, linux-kernel On Fri, Feb 27, 2004 at 10:10:22PM -0800, Martin J. Bligh wrote: > OK, I understand you can window it, but I still don't get where your > figures of 2.7GB/task vs 1.7GB per task come from? 2.7G is what you can map right now in 2.6 with 3:1. dropping 1G from the userspace means reducing the shm mappings to 1.7G of address space. sorry if I made some confusion. > 48GB is sailing damned close to the wind. The problem I've had before is ;) > distros saying "we support X GB of RAM", but it only works for some > workloads, and falls over on others. Oddly enough, that tends to upset > the customers quite a bit ;-) I'd agree with what you say - for a generic > misc load, it might work ... but I'd hate a customer to hear that and > misinterpret it. I see... > > What changes between 3:1 and 2:2 is the "view" on the 20G shm file, not > > the size of the shm. you can do less simultaneous mmap with a 1.7G view > > instead of a 2.7G view. the nonlinear vma will be 1.7G in size with 2:2, > > instead of 2.7G in size with 3:1 or 4:4 (300M are as usual left for some > > hole, the binary itself and the stack) > > Why is it 2.7GB with both 3:1 and 4:4 ... surely it can get bigger on > 4:4 ??? yes it can be bigger there. I wrote it to simplify, I mean it doesn't need to be bigger, but it can. > > the only chance it's faster is if you never use syscalls and you drive > > all interrupts to other cpus and you have an advantage by mapping >2G in > > the same address space. > > I think that's the key - when you need to map a LOT of data into the > address space. Unfortunately, I think that's the kind of app that the > large machines run. agreed. > > I've some doubt 4:4 runs faster anywhere. I could be wrong though. > > There's only one real way to tell ;-) indeed ;) > >> If you send me a *simple* simulation test, I'll gladly run it for you ;-) > >> But I'm not going to go fiddle with Oracle, and thousands of disks ;-) > > > > :) > > > > thanks for the offer! ;) I would prefer a real life db bench since > > syscalls and irqs are an important part of the load that hurts 4:4 most, > > it doesn't need to be necessairly oracle though. And if it's a cpu with > > big tlb cache like p4 it would be prefereable. maybe we should talk > > about this offline. > > I've been talking with others here about running a database workload > test, but it'll probably be on a machine with only 8GB or so. I still > think that's enough to show us something interesting. yes, it should be enough to show something interesting. However the best would be to really run it on a 32G box, 32G should be really show the divergence. getting results w/ and w/o hugetlbfs may be interesting too (it's not clear if 4:4 will benefit more or less from hutetlbfs, it will walk only twice to reach the physical page, but OTOH flushing the tlb so frequently will partly invalidate the huge tlb behaviour). > > agreed. It's just lower prio at the moment since anon memory doesn't > > tend to be that much shared, so the overhead is minimal. > > Yup, that's what my analysis found, most of it falls under the pte_direct > optimisation. The only problem seems to be that at fork/exec time we > set up the chain, then tear it down again, which is ugly. That's the bit > where I like Hugh's stuff. Me too. I've a testcase here that works 50% slower in 2.6 than 2.4, due the slowdown in fork/pagefaults etc.. (real apps of course doesn't show it, this is a "malicious" testcase ;). #include <sys/mman.h> #include <stdio.h> #include <fcntl.h> #define SIZE (1024*1024*1024) int main(int argc, char ** argv) { int fd, level, max_level; char * start, * end, * tmp; max_level = atoi(argv[1]); fd = open("/tmp/x", O_CREAT|O_RDWR); if (fd < 0) perror("open"), exit(1); if (ftruncate(fd, SIZE) < 0) perror("truncate"), exit(1); if ((start = mmap(0, SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0)) == MAP_FAILED) perror("mmap"), exit(1); end = start + SIZE; for (tmp = start; tmp < end; tmp += 4096) { *tmp = 0; } for (level = 0; level < max_level; level++) { if (fork() < 0) perror("fork"), exit(1); if (munmap(start, SIZE) < 0) perror("munmap"), exit(1); if ((start = mmap(0, SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0)) == MAP_FAILED) perror("mmap"), exit(1); end = start + SIZE; for (tmp = start; tmp < end; tmp += 4096) { *(volatile char *)tmp; } } return 0; } (it's insecure since "/tmp/x" is fixed, change that file if you need local security). > >> I don't have time at the moment to go write it at the moment, but I > >> can certainly run it on large end hardware if that helps. > > > > thanks, we should write it someday. that testcase isn't the one suitable > > for the 4:4 vs 2:2 thing though, for that a real life thing is needed > > since irqs, syscalls (and possibly page faults but not that many with a > > db) are fundamental parts of the load. we could write a smarter > > testcase as well, but I guess using a db is simpler, evaluating 2:2 vs > > 4:4 is more a do-once thing, results won't change over time. > > OK, I'll see what people here can do about that ;-) cool ;) as I wrote to Wim to make it more acceptable we'll have to modify your 3.5:0.5 PAE patch to do 2.5:1.5 too, to give userspace another 512m that the kernel actually doesn't need. And still I'm not sure if Wim can live with 1.7G ipcshm and 512m of shmfs window, if that's not enough user address space then it's unlikely this thread will go anywhere since 512m are needed to handle an additional 32G with a reasonable margin (even after shrinking the page_t to the 2.4 levels). The last issue that we may run into are apps assuming the stack is at 3G fixed, some jvm assumed that, but they should be fixed by now (at the very least it's not hard at all to fix those). It also depends on the performance difference if this is worthwhile, if the difference isn't very significant 4:4 will be certainly prefereable so you can also allocate 4G in the same task for apps not using syscalls or page faults or flood of network irqs. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-28 6:43 ` Andrea Arcangeli @ 2004-02-28 7:00 ` Martin J. Bligh 2004-02-28 7:29 ` Andrea Arcangeli 0 siblings, 1 reply; 100+ messages in thread From: Martin J. Bligh @ 2004-02-28 7:00 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Rik van Riel, linux-kernel > The last issue that we may run into are apps assuming the stack is at 3G > fixed, some jvm assumed that, but they should be fixed by now (at the > very least it's not hard at all to fix those). All the potential solutions we're discussing hit that problem so I don't see it matters much which one we choose ;-) > It also depends on the performance difference if this is worthwhile, if > the difference isn't very significant 4:4 will be certainly prefereable > so you can also allocate 4G in the same task for apps not using syscalls > or page faults or flood of network irqs. There are some things that may well help here: one is vsyscall gettimeofday, which will fix up the worst of the issues (the 30% figure you mentioned to me in Ottowa), the other is NAPI, which would help with the network stuff. Bill had a patch to allocate mmaps, etc down from the top of memory and thus elimininate TASK_UNMAPPED_BASE, and shift the stack back into the empty hole from 0-128MB of memory where it belongs (according to the spec). Getting rid of those two problems gives us back a little more userspace as well. Unfortunately it does seem to break some userspace apps making stupid assumptions, but if we have a neat way to mark the binaries (Andi was talking about personalities or something), we could at least get the big mem hogs to do that (databases, java, etc). I have a copy of Bill's patch in my tree if you want to take a look: ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/patches/2.6.3/2.6.3-mjb1/410-topdown That might make your 2.5/1.5 proposal more feasible with less loss of userspace. M ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-28 7:00 ` Martin J. Bligh @ 2004-02-28 7:29 ` Andrea Arcangeli 2004-02-28 14:55 ` Rik van Riel 0 siblings, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-02-28 7:29 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrew Morton, Rik van Riel, linux-kernel On Fri, Feb 27, 2004 at 11:00:44PM -0800, Martin J. Bligh wrote: > > The last issue that we may run into are apps assuming the stack is at 3G > > fixed, some jvm assumed that, but they should be fixed by now (at the > > very least it's not hard at all to fix those). > > All the potential solutions we're discussing hit that problem so I don't > see it matters much which one we choose ;-) I agree, I thought about it too, but I didn't mention that since theoretically 4:4 has a chance to start with a stack at 3G and to depend on the userspace startup to relocate it at 4G ;). x86-64 does something like that to guarantee 100% compatibility. > > It also depends on the performance difference if this is worthwhile, if > > the difference isn't very significant 4:4 will be certainly prefereable > > so you can also allocate 4G in the same task for apps not using syscalls > > or page faults or flood of network irqs. > > There are some things that may well help here: one is vsyscall gettimeofday, > which will fix up the worst of the issues (the 30% figure you mentioned > to me in Ottowa), the other is NAPI, which would help with the network > stuff. I think it's very fair to benchmark vsyscalls with 2.5:1.5 vs vsyscalls with 4:4. However remeber you said you want a generic kernel for 64G right? Not all userspaces will use vsyscalls, and it's not just one app using gettimeofday. As of today no production userspace uses vgettimeofday in x86 yet. I mean, we can tell people to always use vsyscalls with the 4:4 kernel and it's acceptable, but it's not as generic as 2.5:1.5. > Bill had a patch to allocate mmaps, etc down from the top of memory and > thus elimininate TASK_UNMAPPED_BASE, and shift the stack back into the > empty hole from 0-128MB of memory where it belongs (according to the spec). > Getting rid of those two problems gives us back a little more userspace > as well. > > Unfortunately it does seem to break some userspace apps making stupid > assumptions, but if we have a neat way to mark the binaries (Andi was > talking about personalities or something), we could at least get the > big mem hogs to do that (databases, java, etc). I read something about this issue. I agree it must be definitely marked. apps may very well make assumptions about that space being empty below 128m and overwrite it with a mmap() (mmap will just silent overwrite), and I'm unsure if we can claim that to be an userspace bug..., I guess most people will blame the kernel ;) Now that x86 is dying it probably don't worth to mark the binaries, the few apps needing this should relocate the stack by hand and setup the growsdown bitflag. plus they should lower mapped base by hand with the /proc tweak like we do in 2.4. I agree having the stack growsdown at 128 is the best for the db setup, but I doubt we can make it generic and automatic for all apps. Also it's not the stack really the problem in terms of genericity, infact with recursive algos the stack may need to grow a lot, and having it at 128m could segfault. As for mapped-base the space between 128 and 1G may as well be assumpd empty by the apps, so relocation is possible on demand by the app. I doubt we can do better than the above without taking risks ;) > I have a copy of Bill's patch in my tree if you want to take a look: > > ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/patches/2.6.3/2.6.3-mjb1/410-topdown thanks for the pointer. > > That might make your 2.5/1.5 proposal more feasible with less loss of > userspace. Yes. I was sort of assuming that we would use the mapped-base tweak for achieving that, the relocation of the stack is a good idea, and it's doable all in userspace (though it's not generic/automatic). thanks. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-28 7:29 ` Andrea Arcangeli @ 2004-02-28 14:55 ` Rik van Riel 2004-02-28 15:06 ` Arjan van de Ven 2004-02-29 1:43 ` Andrea Arcangeli 0 siblings, 2 replies; 100+ messages in thread From: Rik van Riel @ 2004-02-28 14:55 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Martin J. Bligh, Andrew Morton, linux-kernel On Sat, 28 Feb 2004, Andrea Arcangeli wrote: > I agree, I thought about it too, but I didn't mention that since > theoretically 4:4 has a chance to start with a stack at 3G and to depend > on the userspace startup to relocate it at 4G ;). x86-64 does something > like that to guarantee 100% compatibility. Personalities work fine for that kind of thing. The few buggy apps that can't deal with addresses >3GB (IIRC the JVM in the Oracle installer) get their stack at 3GB, the others get their stack at 4GB. > I think it's very fair to benchmark vsyscalls with 2.5:1.5 vs vsyscalls > with 4:4. The different setups should definately be benchmarked. I know we expected the 4:4 kernel to be slower at everything, but the folks at Oracle actually ran into a few situations where the 4:4 kernel was _faster_ than a 3:1 kernel. Definately not what we expected, but a nice surprise nontheless. > Now that x86 is dying it probably don't worth to mark the binaries, All you need to do for that is to copy some code from RHEL3 ;) > I agree having the stack growsdown at 128 is the best for the db setup, Alternatively, you start the mmap at "stack start - stack ulimit" and grow it down from there. That still gives you 3.8GB of usable address space on x86, with the 4:4 split. ;) cheers, Rik -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-28 14:55 ` Rik van Riel @ 2004-02-28 15:06 ` Arjan van de Ven 2004-02-29 1:43 ` Andrea Arcangeli 1 sibling, 0 replies; 100+ messages in thread From: Arjan van de Ven @ 2004-02-28 15:06 UTC (permalink / raw) To: Rik van Riel Cc: Andrea Arcangeli, Martin J. Bligh, Andrew Morton, linux-kernel [-- Attachment #1: Type: text/plain, Size: 197 bytes --] > > Now that x86 is dying it probably don't worth to mark the binaries, > > All you need to do for that is to copy some code from RHEL3 ;) which we in turn copied from Andi, eg x86_64 [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-28 14:55 ` Rik van Riel 2004-02-28 15:06 ` Arjan van de Ven @ 2004-02-29 1:43 ` Andrea Arcangeli [not found] ` < 1078370073.3403.759.camel@abyss.local> 2004-03-04 3:14 ` Peter Zaitsev 1 sibling, 2 replies; 100+ messages in thread From: Andrea Arcangeli @ 2004-02-29 1:43 UTC (permalink / raw) To: Rik van Riel; +Cc: Martin J. Bligh, Andrew Morton, linux-kernel On Sat, Feb 28, 2004 at 09:55:14AM -0500, Rik van Riel wrote: > The different setups should definately be benchmarked. I know > we expected the 4:4 kernel to be slower at everything, but the > folks at Oracle actually ran into a few situations where the 4:4 > kernel was _faster_ than a 3:1 kernel. > > Definately not what we expected, but a nice surprise nontheless. this is the first time I hear something like this. Maybe you mean the 4:4 was actually using more ram for the SGA? Just curious. ^ permalink raw reply [flat|nested] 100+ messages in thread
[parent not found: < 1078370073.3403.759.camel@abyss.local>]
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-29 1:43 ` Andrea Arcangeli [not found] ` < 1078370073.3403.759.camel@abyss.local> @ 2004-03-04 3:14 ` Peter Zaitsev 2004-03-04 3:33 ` Andrew Morton 1 sibling, 1 reply; 100+ messages in thread From: Peter Zaitsev @ 2004-03-04 3:14 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Martin J. Bligh, Andrew Morton, linux-kernel On Sat, 2004-02-28 at 17:43, Andrea Arcangeli wrote: > > > > Definately not what we expected, but a nice surprise nontheless. > > this is the first time I hear something like this. Maybe you mean the > 4:4 was actually using more ram for the SGA? Just curious. I actually recently Did MySQL benchmarks using DBT2 MySQL port. The test box was 4Way Xeon w HT, 4Gb RAM, 8 SATA Disks in RAID10. I used RH AS 3.0 for tests (2.4.21-9.ELxxx) For Disk Bound workloads (200 Warehouse) I got 1250TPM for "hugemem" vs 1450TPM for "smp" kernel, which is some 14% slowdown. For CPU bound load (10 Warehouses) I got 7000TPM instead of 4500TPM, which is over 35% slowdown. -- Peter Zaitsev, Senior Support Engineer MySQL AB, www.mysql.com Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL) http://www.mysql.com/uc2004/ ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 3:14 ` Peter Zaitsev @ 2004-03-04 3:33 ` Andrew Morton 2004-03-04 3:44 ` Peter Zaitsev 0 siblings, 1 reply; 100+ messages in thread From: Andrew Morton @ 2004-03-04 3:33 UTC (permalink / raw) To: Peter Zaitsev; +Cc: andrea, riel, mbligh, linux-kernel Peter Zaitsev <peter@mysql.com> wrote: > > On Sat, 2004-02-28 at 17:43, Andrea Arcangeli wrote: > > > > > > > Definately not what we expected, but a nice surprise nontheless. > > > > this is the first time I hear something like this. Maybe you mean the > > 4:4 was actually using more ram for the SGA? Just curious. > > I actually recently Did MySQL benchmarks using DBT2 MySQL port. > > The test box was 4Way Xeon w HT, 4Gb RAM, 8 SATA Disks in RAID10. > > I used RH AS 3.0 for tests (2.4.21-9.ELxxx) > > For Disk Bound workloads (200 Warehouse) I got 1250TPM for "hugemem" vs > 1450TPM for "smp" kernel, which is some 14% slowdown. Please define these terms. What is the difference between "hugemem" and "smp"? > For CPU bound load (10 Warehouses) I got 7000TPM instead of 4500TPM, > which is over 35% slowdown. Well no, it is a 56% speedup. Please clarify. Lots. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 3:33 ` Andrew Morton @ 2004-03-04 3:44 ` Peter Zaitsev 2004-03-04 4:07 ` Andrew Morton 2004-03-05 10:33 ` Ingo Molnar 0 siblings, 2 replies; 100+ messages in thread From: Peter Zaitsev @ 2004-03-04 3:44 UTC (permalink / raw) To: Andrew Morton; +Cc: andrea, riel, mbligh, linux-kernel On Wed, 2004-03-03 at 19:33, Andrew Morton wrote: > > > > For Disk Bound workloads (200 Warehouse) I got 1250TPM for "hugemem" vs > > 1450TPM for "smp" kernel, which is some 14% slowdown. > > Please define these terms. What is the difference between "hugemem" and > "smp"? Andrew, Sorry if I was unclear. These are suffexes from RH AS 3.0 kernel namings. "SMP" corresponds to normal SMP kernel they have, "hugemem" is kernel with 4G/4G split. > > > For CPU bound load (10 Warehouses) I got 7000TPM instead of 4500TPM, > > which is over 35% slowdown. > > Well no, it is a 56% speedup. Please clarify. Lots. Huh. The numbers shall be other way around of course :) "smp" kernel had better performance of some 7000TPM, compared to 4500TPM with HugeMem kernel. Swap was disable in both cases. -- Peter Zaitsev, Senior Support Engineer MySQL AB, www.mysql.com Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL) http://www.mysql.com/uc2004/ ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 3:44 ` Peter Zaitsev @ 2004-03-04 4:07 ` Andrew Morton 2004-03-04 4:44 ` Peter Zaitsev ` (2 more replies) 2004-03-05 10:33 ` Ingo Molnar 1 sibling, 3 replies; 100+ messages in thread From: Andrew Morton @ 2004-03-04 4:07 UTC (permalink / raw) To: Peter Zaitsev; +Cc: andrea, riel, mbligh, linux-kernel Peter Zaitsev <peter@mysql.com> wrote: > > Sorry if I was unclear. These are suffexes from RH AS 3.0 kernel > namings. "SMP" corresponds to normal SMP kernel they have, "hugemem" > is kernel with 4G/4G split. > > > > > > For CPU bound load (10 Warehouses) I got 7000TPM instead of 4500TPM, > > > which is over 35% slowdown. > > > > Well no, it is a 56% speedup. Please clarify. Lots. > > Huh. The numbers shall be other way around of course :) "smp" kernel > had better performance of some 7000TPM, compared to 4500TPM with > HugeMem kernel. That's a larger difference than I expected. But then, everyone has been mysteriously quiet with the 4g/4g benchmarking. A kernel profile would be interesting. As would an optimisation effort, which, as far as I know, has never been undertaken. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 4:07 ` Andrew Morton @ 2004-03-04 4:44 ` Peter Zaitsev 2004-03-04 4:52 ` Andrea Arcangeli 2004-03-04 17:35 ` Martin J. Bligh 2 siblings, 0 replies; 100+ messages in thread From: Peter Zaitsev @ 2004-03-04 4:44 UTC (permalink / raw) To: Andrew Morton; +Cc: andrea, riel, mbligh, linux-kernel On Wed, 2004-03-03 at 20:07, Andrew Morton wrote: > > Huh. The numbers shall be other way around of course :) "smp" kernel > > had better performance of some 7000TPM, compared to 4500TPM with > > HugeMem kernel. > > That's a larger difference than I expected. But then, everyone has been > mysteriously quiet with the 4g/4g benchmarking. Yes. It is larger than I expected as well but numbers are pretty reliable. > > A kernel profile would be interesting. As would an optimisation effort, > which, as far as I know, has never been undertaken. Just let me know which information you would like me to gather and how and I'll get it for you. -- Peter Zaitsev, Senior Support Engineer MySQL AB, www.mysql.com Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL) http://www.mysql.com/uc2004/ ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 4:07 ` Andrew Morton 2004-03-04 4:44 ` Peter Zaitsev @ 2004-03-04 4:52 ` Andrea Arcangeli 2004-03-04 5:10 ` Andrew Morton ` (2 more replies) 2004-03-04 17:35 ` Martin J. Bligh 2 siblings, 3 replies; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-04 4:52 UTC (permalink / raw) To: Andrew Morton; +Cc: Peter Zaitsev, riel, mbligh, linux-kernel On Wed, Mar 03, 2004 at 08:07:04PM -0800, Andrew Morton wrote: > That's a larger difference than I expected. But then, everyone has been mysql is threaded (it's not using processes that force tlb flushes at every context switch), so the only time a tlb flush ever happens is when a syscall or an irq or a page fault happens with 4:4. Not tlb flush would ever happen with 3:1 in the whole workload (yeah, some background tlb flushing happens anyways when you type char on bash or move the mouse of course but it's very low frequency) (to be fair, because it's threaded it means they also find 512m of address space lost more problematic than the db using processes, though besides the reduced address space there would be no measurable slowdown with 2.5:1.5) Also the 4:4 pretty much depends on the vgettimeofday to be backported from the x86-64 tree and an userspace to use it, so the test may be repeated with vgettimeofday, though it's very possible mysql isn't using that much gettimeofday as other databases, especially the I/O bound workload shouldn't matter that much with gettimeofday. another reason could be the xeon bit, all numbers I've seen were on p3, that's why I was asking about xeon and p4 or more recent. all random ideas, just guessing. > mysteriously quiet with the 4g/4g benchmarking. indeed. > A kernel profile would be interesting. As would an optimisation effort, > which, as far as I know, has never been undertaken. yes, though I doubt you'll find anything interesting in the kernel, the slowdown should happen because the userspace runs slower, it's like undercloking the cpu, it's not a bottleneck in the kernel that can be optimized (at least unless there are bugs in the patch which I think not). ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 4:52 ` Andrea Arcangeli @ 2004-03-04 5:10 ` Andrew Morton 2004-03-04 5:27 ` Andrea Arcangeli 2004-03-05 20:19 ` Jamie Lokier 2004-03-04 12:12 ` Rik van Riel 2004-03-04 16:21 ` Peter Zaitsev 2 siblings, 2 replies; 100+ messages in thread From: Andrew Morton @ 2004-03-04 5:10 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: peter, riel, mbligh, linux-kernel Andrea Arcangeli <andrea@suse.de> wrote: > > On Wed, Mar 03, 2004 at 08:07:04PM -0800, Andrew Morton wrote: > > That's a larger difference than I expected. But then, everyone has been > > mysql is threaded There is a patch in -mm's 4g/4g implementation (4g4g-locked-userspace-copy.patch) which causes all kernel<->userspace copies to happen under page_table_lock. In some threaded apps on SMP this is likely to cause utterly foul performance. That's why I'm keeping it as a separate patch. The problem which it fixes is very obscure indeed and I suspect most implementors will simply drop it after they'e had a two-second peek at the profile results. hm, I note that the changelog in that patch is junk. I'll fix that up. Something like: The current 4g/4g implementation does not guarantee the atomicity of mprotect() on SMP machines. If one CPU is in the middle of a read() into a user memory region and another CPU is in the middle of an mprotect(!PROT_READ) of that region, it is possible for a race to occur which will result in that read successfully completing _after_ the other CPU's mprotect() call has returned. We believe that this could cause misbehaviour of such things as the boehm garbage collector. This patch provides the mprotect() atomicity by performing all userspace copies under page_table_lock. It is a judgement call. Personally, I wouldn't ship a production kernel with this patch. People need to be aware of the tradeoff and to think and test very carefully. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 5:10 ` Andrew Morton @ 2004-03-04 5:27 ` Andrea Arcangeli 2004-03-04 5:38 ` Andrew Morton 2004-03-05 20:19 ` Jamie Lokier 1 sibling, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-04 5:27 UTC (permalink / raw) To: Andrew Morton; +Cc: peter, riel, mbligh, linux-kernel On Wed, Mar 03, 2004 at 09:10:42PM -0800, Andrew Morton wrote: > Andrea Arcangeli <andrea@suse.de> wrote: > > > > On Wed, Mar 03, 2004 at 08:07:04PM -0800, Andrew Morton wrote: > > > That's a larger difference than I expected. But then, everyone has been > > > > mysql is threaded > > There is a patch in -mm's 4g/4g implementation > (4g4g-locked-userspace-copy.patch) which causes all kernel<->userspace > copies to happen under page_table_lock. In some threaded apps on SMP this > is likely to cause utterly foul performance. I see, I wasn't aware about this issue with the copy-user code, thanks for the info, I definitely agree having a profiling of the run would be nice since it maybe part of the overhead is due this lock (though I doubt it's most the overhead), so we can see if it was that spinlock generating part of the slowdown. > That's why I'm keeping it as a separate patch. The problem which it fixes > is very obscure indeed and I suspect most implementors will simply drop it > after they'e had a two-second peek at the profile results. I doubt one can ship without it without feeling a bit like cheating, the garbage collectors sometime depends on mprotect to generate protection faults, it's not like nothing is using mprotect in racy ways against other threads. > It is a judgement call. Personally, I wouldn't ship a production kernel > with this patch. People need to be aware of the tradeoff and to think and > test very carefully. test what? there's no way to know what soft of proprietary software people will run on the thing. Personally I wouldn't feel safe to ship a kernel with a known race condition add-on. I mean, if you don't know about it and it's an implementation bug you know nobody is perfect and you try to fix it if it happens, but if you know about it and you don't apply it, that's pretty bad if something goes wrong. Especially because it's a race, even you test it, it may still happen only a long time later during production. I would never trade performance for safety, if something I'd try to find a more complex way to serialize against the vmas or similar. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 5:27 ` Andrea Arcangeli @ 2004-03-04 5:38 ` Andrew Morton 0 siblings, 0 replies; 100+ messages in thread From: Andrew Morton @ 2004-03-04 5:38 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: peter, riel, mbligh, linux-kernel Andrea Arcangeli <andrea@suse.de> wrote: > > > It is a judgement call. Personally, I wouldn't ship a production kernel > > with this patch. People need to be aware of the tradeoff and to think and > > test very carefully. > > test what? there's no way to know what soft of proprietary software > people will run on the thing. In the vast majority of cases the application was already racy. It took davem a very long time to convince me that this was really a bug ;) > Personally I wouldn't feel safe to ship a kernel with a known race > condition add-on. I mean, if you don't know about it and it's an > implementation bug you know nobody is perfect and you try to fix it if > it happens, but if you know about it and you don't apply it, that's > pretty bad if something goes wrong. Especially because it's a race, > even you test it, it may still happen only a long time later during > production. I would never trade performance for safety, if something I'd > try to find a more complex way to serialize against the vmas or similar. Well first people need to understand the problem and convince themselves that this really is a bug. And yes, there are surely other ways of fixing it up. One might be to put some sequence counter in the mm_struct and rerun the mprotect if it detects that someone else snuck in with a usercopy. Or add an rwsem to the mm_struct, take it for writing in mprotect. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 5:10 ` Andrew Morton 2004-03-04 5:27 ` Andrea Arcangeli @ 2004-03-05 20:19 ` Jamie Lokier 2004-03-05 20:33 ` Andrea Arcangeli 1 sibling, 1 reply; 100+ messages in thread From: Jamie Lokier @ 2004-03-05 20:19 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrea Arcangeli, peter, riel, mbligh, linux-kernel Andrew Morton wrote: > We believe that this could cause misbehaviour of such things as the > boehm garbage collector. This patch provides the mprotect() atomicity by > performing all userspace copies under page_table_lock. Can you use a read-write lock, so that userspace copies only need to take the lock for reading? That doesn't eliminate cacheline bouncing but does eliminate the serialisation. Or did you do that already, and found performance is still very low? > It is a judgement call. Personally, I wouldn't ship a production kernel > with this patch. People need to be aware of the tradeoff and to think and > test very carefully. If this isn't fixed, _please_ provide a way for a garbage collector to query the kernel as to whether this race condition is present. -- Jamie ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 20:19 ` Jamie Lokier @ 2004-03-05 20:33 ` Andrea Arcangeli 2004-03-05 21:44 ` Jamie Lokier 0 siblings, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-05 20:33 UTC (permalink / raw) To: Jamie Lokier; +Cc: Andrew Morton, peter, riel, mbligh, linux-kernel On Fri, Mar 05, 2004 at 08:19:55PM +0000, Jamie Lokier wrote: > Andrew Morton wrote: > > We believe that this could cause misbehaviour of such things as the > > boehm garbage collector. This patch provides the mprotect() atomicity by > > performing all userspace copies under page_table_lock. > > Can you use a read-write lock, so that userspace copies only need to > take the lock for reading? That doesn't eliminate cacheline bouncing > but does eliminate the serialisation. normally the bouncing would be the only overhead, but here I also think the serialization is a significant factor of the contention because the critical section is taking lots of time. So I would expect some improvement by using a read/write lock. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 20:33 ` Andrea Arcangeli @ 2004-03-05 21:44 ` Jamie Lokier 0 siblings, 0 replies; 100+ messages in thread From: Jamie Lokier @ 2004-03-05 21:44 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, peter, riel, mbligh, linux-kernel Andrea Arcangeli wrote: > > Can you use a read-write lock, so that userspace copies only need to > > take the lock for reading? That doesn't eliminate cacheline bouncing > > but does eliminate the serialisation. > > normally the bouncing would be the only overhead, but here I also think > the serialization is a significant factor of the contention because the > critical section is taking lots of time. So I would expect some > improvement by using a read/write lock. For something as significant as user<->kernel data transfers, it might be worth eliminating the bouncing as well - by using per-CPU * per-mm spinlocks. User<->kernel data transfers would take the appropriate per-CPU lock for the current mm, and not take page_table_lock. Everything that normally takes page_table_lock would, and also take all of the per-CPU locks. That does require a set of per-CPU spinlocks to be allocated whenever a new mm is allocated (although the sets could be cached so it needn't be slow). -- Jamie ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 4:52 ` Andrea Arcangeli 2004-03-04 5:10 ` Andrew Morton @ 2004-03-04 12:12 ` Rik van Riel 2004-03-04 16:21 ` Peter Zaitsev 2 siblings, 0 replies; 100+ messages in thread From: Rik van Riel @ 2004-03-04 12:12 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Peter Zaitsev, mbligh, linux-kernel On Thu, 4 Mar 2004, Andrea Arcangeli wrote: > On Wed, Mar 03, 2004 at 08:07:04PM -0800, Andrew Morton wrote: > > A kernel profile would be interesting. As would an optimisation effort, > > which, as far as I know, has never been undertaken. > > yes, though I doubt you'll find anything interesting in the kernel, Oh, but there is a big bottleneck left, at least in RHEL3. All the CPUs use the _same_ mm_struct in kernel space, so all VM operations inside the kernel are effectively single threaded. Ingo had a patch to fix that, but it wasn't ready in time. Maybe it is in the 2.6 patch set, maybe not ... -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 4:52 ` Andrea Arcangeli 2004-03-04 5:10 ` Andrew Morton 2004-03-04 12:12 ` Rik van Riel @ 2004-03-04 16:21 ` Peter Zaitsev 2004-03-04 18:13 ` Andrea Arcangeli 2 siblings, 1 reply; 100+ messages in thread From: Peter Zaitsev @ 2004-03-04 16:21 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, riel, mbligh, linux-kernel On Wed, 2004-03-03 at 20:52, Andrea Arcangeli wrote: Andrea, > mysql is threaded (it's not using processes that force tlb flushes at > every context switch), so the only time a tlb flush ever happens is when > a syscall or an irq or a page fault happens with 4:4. Not tlb flush > would ever happen with 3:1 in the whole workload (yeah, some background > tlb flushing happens anyways when you type char on bash or move the > mouse of course but it's very low frequency) Do not we get TLB flush also due to latching or are pthread_mutex_lock etc implemented without one nowadays ? > > (to be fair, because it's threaded it means they also find 512m of > address space lost more problematic than the db using processes, though > besides the reduced address space there would be no measurable slowdown > with 2.5:1.5) Hm. What 512Mb of address space loss are you speaking here. Are threaded programs only able to use 2.5G in 3G/1G memory split ? > > Also the 4:4 pretty much depends on the vgettimeofday to be backported > from the x86-64 tree and an userspace to use it, so the test may be > repeated with vgettimeofday, though it's very possible mysql isn't using > that much gettimeofday as other databases, especially the I/O bound > workload shouldn't matter that much with gettimeofday. You're right. MySQL does not use gettimeofday very frequently now, actually it uses time() most of the time, as some platforms used to have huge performance problems with gettimeofday() in the past. The amount of gettimeofday() use will increase dramatically in the future so it is good to know about this matter. -- Peter Zaitsev, Senior Support Engineer MySQL AB, www.mysql.com Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL) http://www.mysql.com/uc2004/ ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 16:21 ` Peter Zaitsev @ 2004-03-04 18:13 ` Andrea Arcangeli 0 siblings, 0 replies; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-04 18:13 UTC (permalink / raw) To: Peter Zaitsev; +Cc: Andrew Morton, riel, mbligh, linux-kernel On Thu, Mar 04, 2004 at 08:21:26AM -0800, Peter Zaitsev wrote: > On Wed, 2004-03-03 at 20:52, Andrea Arcangeli wrote: > > Andrea, > > > mysql is threaded (it's not using processes that force tlb flushes at > > every context switch), so the only time a tlb flush ever happens is when > > a syscall or an irq or a page fault happens with 4:4. Not tlb flush > > would ever happen with 3:1 in the whole workload (yeah, some background > > tlb flushing happens anyways when you type char on bash or move the > > mouse of course but it's very low frequency) > > Do not we get TLB flush also due to latching or are pthread_mutex_lock > etc implemented without one nowadays ? pthread mutex uses futex in nptl and ngpt or they use sched_yield in linuxthreads, either ways they don't need to flush the tlb. The address space is the same, no need of changing address space for the mutex (otherwise mutex would be very detrimental too). Kernel threads as well don't require a tlb flush. > > (to be fair, because it's threaded it means they also find 512m of > > address space lost more problematic than the db using processes, though > > besides the reduced address space there would be no measurable slowdown > > with 2.5:1.5) > > Hm. What 512Mb of address space loss are you speaking here. Are threaded > programs only able to use 2.5G in 3G/1G memory split ? I was talking about the 2.5:1.5: split here, 3:1 gives you 3G of address space (both for threads and processes), 2.5:1.5 would give you only 2.5G of address space to use instead (with a loss of 512m that are being used by kernel to handle properly a 64G box). > > Also the 4:4 pretty much depends on the vgettimeofday to be backported > > from the x86-64 tree and an userspace to use it, so the test may be > > repeated with vgettimeofday, though it's very possible mysql isn't using > > that much gettimeofday as other databases, especially the I/O bound > > workload shouldn't matter that much with gettimeofday. > > You're right. MySQL does not use gettimeofday very frequently now, > actually it uses time() most of the time, as some platforms used to have > huge performance problems with gettimeofday() in the past. > > The amount of gettimeofday() use will increase dramatically in the > future so it is good to know about this matter. If you noticed Martin mentioned a >30% figure due gettimeofday being called frequently (w/o vsyscalls implementing vgettimeofday like in x86-64), this figure it certainly won't sum to your current number linearly but you can expect a significant further loss by calling gettimeofday dramatically more frequently. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 4:07 ` Andrew Morton 2004-03-04 4:44 ` Peter Zaitsev 2004-03-04 4:52 ` Andrea Arcangeli @ 2004-03-04 17:35 ` Martin J. Bligh 2004-03-04 18:16 ` Andrea Arcangeli 2004-03-04 20:21 ` Peter Zaitsev 2 siblings, 2 replies; 100+ messages in thread From: Martin J. Bligh @ 2004-03-04 17:35 UTC (permalink / raw) To: Andrew Morton, Peter Zaitsev; +Cc: andrea, riel, linux-kernel > Peter Zaitsev <peter@mysql.com> wrote: >> >> Sorry if I was unclear. These are suffexes from RH AS 3.0 kernel >> namings. "SMP" corresponds to normal SMP kernel they have, "hugemem" >> is kernel with 4G/4G split. >> >> > >> > > For CPU bound load (10 Warehouses) I got 7000TPM instead of 4500TPM, >> > > which is over 35% slowdown. >> > >> > Well no, it is a 56% speedup. Please clarify. Lots. >> >> Huh. The numbers shall be other way around of course :) "smp" kernel >> had better performance of some 7000TPM, compared to 4500TPM with >> HugeMem kernel. > > That's a larger difference than I expected. But then, everyone has been > mysteriously quiet with the 4g/4g benchmarking. > > A kernel profile would be interesting. As would an optimisation effort, > which, as far as I know, has never been undertaken. In particular: 1. a diffprofile between the two would be interesting (assuming it's at least partly increase in kernel time), or any other way to see exactly why it's slower (well, TLB flushes, obviously, but what's causing them). 2. If it's gettimeofday hammering it (which it probably is, from previous comments by others, and my own experience), then vsyscall gettimeofday (John's patch) may well fix it up. 3. Are you using the extra user address space? Otherwise yes, it'll be all downside. And 4/4 vs 3/1 isn't really a fair comparison ... 4/4 is designed for bigboxen, so 4/4 vs 2/2 would be better, IMHO. People have said before that DB performance can increase linearly with shared area sizes (for some workloads), so that'd bring you a 100% or so increase in performance for 4/4 to counter the loss. M. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 17:35 ` Martin J. Bligh @ 2004-03-04 18:16 ` Andrea Arcangeli 2004-03-04 19:31 ` Martin J. Bligh 2004-03-04 20:21 ` Peter Zaitsev 1 sibling, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-04 18:16 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrew Morton, Peter Zaitsev, riel, linux-kernel On Thu, Mar 04, 2004 at 09:35:13AM -0800, Martin J. Bligh wrote: > designed for bigboxen, so 4/4 vs 2/2 would be better, IMHO. People have > said before that DB performance can increase linearly with shared area > sizes (for some workloads), so that'd bring you a 100% or so increase > in performance for 4/4 to counter the loss. that's a nice theory with the benchmarks that runs with a 64G working set, but if your working set is smaller than 32G 99% of the time and you install the 64G to handle the peak load happening 1% of the time faster, you'll run 30% slower 99% of the time even if the benchmark only stressing the 64G working set runs a lot faster than with 32G only. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 18:16 ` Andrea Arcangeli @ 2004-03-04 19:31 ` Martin J. Bligh 0 siblings, 0 replies; 100+ messages in thread From: Martin J. Bligh @ 2004-03-04 19:31 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Peter Zaitsev, riel, linux-kernel > On Thu, Mar 04, 2004 at 09:35:13AM -0800, Martin J. Bligh wrote: >> designed for bigboxen, so 4/4 vs 2/2 would be better, IMHO. People have >> said before that DB performance can increase linearly with shared area >> sizes (for some workloads), so that'd bring you a 100% or so increase >> in performance for 4/4 to counter the loss. > > that's a nice theory with the benchmarks that runs with a 64G working > set, but if your working set is smaller than 32G 99% of the time and > you install the 64G to handle the peak load happening 1% of the time > faster, you'll run 30% slower 99% of the time even if the benchmark > only stressing the 64G working set runs a lot faster than with 32G only. The amount of ram in the system, and the amount consumed by mem_map can, I think, be taken as static for the purposes of this argument. So I don't see why the total working set of the machine matters. What does matter is the per-process user address space set - if the same argument applied to that (ie most of the time, processes only use 1GB of shmem each), then I'd agree with you. I don't know whether that's true or not though ... I'll let the DB people argue that one out. Much though people hate benchmarks, it's also important to be able to prove that Linux can run as fast as RandomOtherOS in order to ensure total world domination for Linux ;-) So it would be nice to ensure the benchmarks at least have an option to be able to run as fast as possible. M. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 17:35 ` Martin J. Bligh 2004-03-04 18:16 ` Andrea Arcangeli @ 2004-03-04 20:21 ` Peter Zaitsev 1 sibling, 0 replies; 100+ messages in thread From: Peter Zaitsev @ 2004-03-04 20:21 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Andrew Morton, andrea, riel, linux-kernel On Thu, 2004-03-04 at 09:35, Martin J. Bligh wrote: > > 2. If it's gettimeofday hammering it (which it probably is, from previous > comments by others, and my own experience), then vsyscall gettimeofday > (John's patch) may well fix it up. Well, as I wrote MySQL does not use a lot of gettimeofday. It rather has 2-3 calls to time() per query, but it is very small number compared to othet syscalls. > > 3. Are you using the extra user address space? Otherwise yes, it'll be > all downside. And 4/4 vs 3/1 isn't really a fair comparison ... 4/4 is > designed for bigboxen, so 4/4 vs 2/2 would be better, IMHO. People have > said before that DB performance can increase linearly with shared area > sizes (for some workloads), so that'd bring you a 100% or so increase > in performance for 4/4 to counter the loss. I do not really understand this :) I know 4/4 was designed for BigBoxes, however we're more interested in side effect we have - having 4G per user process instead of 3G in 3G/1G split. As MySQL is designed as single process this is what rather important for us. I was not using extra address space in this test, as the idea was to see how much slowdown 4G/4G split gives you with all other being the same. Based on other benchmarks I know extra performance extra 1Gb used as buffers can give. Bringing this numbers together I shall conclude what 4G/4G does not make sense for most MySQL loads, as 1Gb used for internal buffers (vs 1Gb used for file cache) will not give high enough performance to cover such major speed loss. There are exceptions of course, for example the case where your full workload will fit in 3G cache while will not fit in 2G (very edge one), or in case you need 4G just to manage 10000+ connections with reasonable buffers etc, which is also far from most typical scenario. For "Big Boxes" I just would not advice having 32bit configuration at all - happily nowadays you can get 64bit pretty cheap. -- Peter Zaitsev, Senior Support Engineer MySQL AB, www.mysql.com Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL) http://www.mysql.com/uc2004/ ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-04 3:44 ` Peter Zaitsev 2004-03-04 4:07 ` Andrew Morton @ 2004-03-05 10:33 ` Ingo Molnar 2004-03-05 14:15 ` Andrea Arcangeli 1 sibling, 1 reply; 100+ messages in thread From: Ingo Molnar @ 2004-03-05 10:33 UTC (permalink / raw) To: Peter Zaitsev; +Cc: Andrew Morton, andrea, riel, mbligh, linux-kernel * Peter Zaitsev <peter@mysql.com> wrote: > > > For Disk Bound workloads (200 Warehouse) I got 1250TPM for "hugemem" vs > > > 1450TPM for "smp" kernel, which is some 14% slowdown. > > > > Please define these terms. What is the difference between "hugemem" and > > "smp"? > > Sorry if I was unclear. These are suffexes from RH AS 3.0 kernel > namings. "SMP" corresponds to normal SMP kernel they have, "hugemem" > is kernel with 4G/4G split. the 'hugemem' kernel also has config_highpte defined which is a bit redundant - that complexity one could avoid with the 4/4 split. Another detail: the hugemem kernel also enables PAE, which adds another 2 usecs to every syscall (!). So these performance numbers only hold if you are running mysql on x86 using more than 4GB of RAM. (which, given mysql's threaded design, doesnt make all that much of a sense.) But no doubt, the 4/4 split is not for free. If a workload does lots of high-frequency system-calls then the cost can be pretty high. vsyscall-sys_gettimeofday and vsyscall-sys_time could help quite some for mysql. Also, the highly threaded nature of mysql on the same MM which is pretty much the worst-case for the 4/4 design. If it's an issue, there are multiple ways to mitigate this cost. but 4/4 is mostly a life-extender for the high end of the x86 platform - which is dying fast. If i were to decide between some of the highly intrusive architectural highmem solutions (which all revolve about the concept of dynamically mapping back and forth) and the simplicity of 4/4, i'd go for 4/4 unless forced otherwise. Ingo ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 10:33 ` Ingo Molnar @ 2004-03-05 14:15 ` Andrea Arcangeli 2004-03-05 14:32 ` Ingo Molnar 2004-03-05 14:34 ` Ingo Molnar 0 siblings, 2 replies; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-05 14:15 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel On Fri, Mar 05, 2004 at 11:33:08AM +0100, Ingo Molnar wrote: > > * Peter Zaitsev <peter@mysql.com> wrote: > > > > > For Disk Bound workloads (200 Warehouse) I got 1250TPM for "hugemem" vs > > > > 1450TPM for "smp" kernel, which is some 14% slowdown. > > > > > > Please define these terms. What is the difference between "hugemem" and > > > "smp"? > > > > Sorry if I was unclear. These are suffexes from RH AS 3.0 kernel > > namings. "SMP" corresponds to normal SMP kernel they have, "hugemem" > > is kernel with 4G/4G split. > > the 'hugemem' kernel also has config_highpte defined which is a bit > redundant - that complexity one could avoid with the 4/4 split. Another the machine only has 4G of ram and you've an huge zone-normal, so I guess it will offset not more than 1 point percent or so. > detail: the hugemem kernel also enables PAE, which adds another 2 usecs > to every syscall (!). So these performance numbers only hold if you are > running mysql on x86 using more than 4GB of RAM. (which, given mysql's > threaded design, doesnt make all that much of a sense.) are you saying you force _all_ people with >4G of ram to use 4:4?!? that would be way way overkill. 8/16/32G boxes works perfectly with 3:1 with the stock 2.4 VM (after you nuke rmap). > vsyscall-sys_gettimeofday and vsyscall-sys_time could help quite some > for mysql. Also, the highly threaded nature of mysql on the same MM he said he doesn't use gettimeofday frequently, so most of the flushes are from other syscalls. > which is pretty much the worst-case for the 4/4 design. If it's an definitely agreed. > issue, there are multiple ways to mitigate this cost. how? just curious. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 14:15 ` Andrea Arcangeli @ 2004-03-05 14:32 ` Ingo Molnar 2004-03-05 14:58 ` Andrea Arcangeli 2004-03-05 14:34 ` Ingo Molnar 1 sibling, 1 reply; 100+ messages in thread From: Ingo Molnar @ 2004-03-05 14:32 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel * Andrea Arcangeli <andrea@suse.de> wrote: > [...] 8/16/32G boxes works perfectly with 3:1 with the stock 2.4 VM > (after you nuke rmap). the mem_map[] on 32G is 400 MB (using the stock 2.4 struct page). This leaves ~500 MB for the lowmem zone. It's ridiculously easy to use up 500 MB of lowmem. 500 MB is a lowmem:RAM ratio of 1:60. With 4/4 you have 6 times more lowmem. So starting at 32 GB (but often much earlier) the 3/1 split breaks down. And you obviously it's a no-go at 64 GB. inbetween it all depends on the workload. If the 3:1 split works fine then sure, use it. There's no one kernel that fits all sizes. Ingo ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 14:32 ` Ingo Molnar @ 2004-03-05 14:58 ` Andrea Arcangeli 2004-03-05 15:26 ` Ingo Molnar 2004-03-05 18:42 ` Martin J. Bligh 0 siblings, 2 replies; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-05 14:58 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel On Fri, Mar 05, 2004 at 03:32:10PM +0100, Ingo Molnar wrote: > > * Andrea Arcangeli <andrea@suse.de> wrote: > > > [...] 8/16/32G boxes works perfectly with 3:1 with the stock 2.4 VM > > (after you nuke rmap). > > the mem_map[] on 32G is 400 MB (using the stock 2.4 struct page). This > leaves ~500 MB for the lowmem zone. It's ridiculously easy to use up 500 yes, mem_map_t takes 384M that leaves us 879-384 = 495Mbyte of zone-normal. > MB of lowmem. 500 MB is a lowmem:RAM ratio of 1:60. With 4/4 you have 6 > times more lowmem. So starting at 32 GB (but often much earlier) the 3/1 > split breaks down. And you obviously it's a no-go at 64 GB. It's a nogo for 64G but I would be really pleased to see a workload triggering the zone-normal shortage in 32G, I've never seen any one. And 16G has even more margin. Note that on a 32G box with my google-logic a correct kernel like latest 2.4 mainline reserves 100% of the zone-normal to allocations that cannot go in highmem, plus the vm highmem fixes like bh and inode zone-normal related reclaims. Without those logics it would be easy to run oom due highmem allocations going into zone-normal but that's just a vm issue and it's fixed (all fixes should be in mainline already). > inbetween it all depends on the workload. If the 3:1 split works fine > then sure, use it. There's no one kernel that fits all sizes. yes, the inbetween definitely works fine but there's always plenty of margin even on the 32G in all heavy workloads I've seen. I've not a single pending report for 32G boxes, all the bugreports starts at >=48G and that tells you those 32G users had a 198M of margin free to use for the peak loads which are more than enough in practice. I agree it's not a huge margin, but it's quite reasonable considering they've only 60-70% of the zone-normal pinned during the workload. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 14:58 ` Andrea Arcangeli @ 2004-03-05 15:26 ` Ingo Molnar 2004-03-05 15:53 ` Andrea Arcangeli 2004-03-05 21:28 ` Martin J. Bligh 2004-03-05 18:42 ` Martin J. Bligh 1 sibling, 2 replies; 100+ messages in thread From: Ingo Molnar @ 2004-03-05 15:26 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel * Andrea Arcangeli <andrea@suse.de> wrote: > It's a nogo for 64G but I would be really pleased to see a workload > triggering the zone-normal shortage in 32G, I've never seen any one. > [...] have you tried TPC-C/TPC-H? Ingo ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 15:26 ` Ingo Molnar @ 2004-03-05 15:53 ` Andrea Arcangeli 2004-03-07 8:41 ` Ingo Molnar 2004-03-05 21:28 ` Martin J. Bligh 1 sibling, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-05 15:53 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel On Fri, Mar 05, 2004 at 04:26:22PM +0100, Ingo Molnar wrote: > have you tried TPC-C/TPC-H? not sure, I'm not the one dealing with the testing, but most relevant data is public on the official websites. the limit reached is around 5k users with 8cpus 32G and I don't recall that limit to be zone-normal bound. With 2.6 and bio and remap_file_pages we may reduce the zone-normal usage as well (after dropping rmap). But I definitely agree going past that with 3:1 is not feasible. Overall we may argue about the 32G (especially a 32-way would be more problematic due the 4 times higher per-cpu memory reservation in zone-normal, I mean 48M of zone-normal are just wasted in the page allocator per-cpu logic, without counting the other per-cpu stuff, all would be easily fixable by limiting the per-cpu sizes, though for 2.4 it probably doesn't worth it), but I'm quite confortable to say that up to 16G (included) 4:4 is worthless unless you've to deal with the rmap waste IMHO. And <= 16G probably counts for 99% of machines out there which are handled optimally by 3:1. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 15:53 ` Andrea Arcangeli @ 2004-03-07 8:41 ` Ingo Molnar 2004-03-07 10:29 ` Nick Piggin 2004-03-07 17:24 ` Andrea Arcangeli 0 siblings, 2 replies; 100+ messages in thread From: Ingo Molnar @ 2004-03-07 8:41 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel * Andrea Arcangeli <andrea@suse.de> wrote: > [...] but I'm quite confortable to say that up to 16G (included) 4:4 > is worthless unless you've to deal with the rmap waste IMHO. [...] i've seen workloads on 8G RAM systems that easily filled up the ~800 MB lowmem zone. (it had to do with many files and having them as a big dentry cache, so yes, it's unfixable unless you start putting inodes into highmem which is crazy. And yes, performance broke down unless most of the dentries/inodes were cached in lowmem.) as i said - it all depends on the workload, and users are amazingly creative at finding all sorts of workloads. Whether 4:4 or 3:1 is thus workload dependent. should lowmem footprint be reduced? By all means yes, but only as long as it doesnt jeopardize the real 64-bit platforms. Is 3:1 adequate as a generic x86 kernel for absolutely everything up to and including 16 GB? Strong no. [not to mention that 'up to 16 GB' is an artificial thing created by us which wont satisfy an IHV that has a hw line with RAM up to 32 or 64 GB. It doesnt matter that 90% of the customers wont have that much RAM, it's a basic "can it scale to that much RAM" question.] so i think the right answer is to have 4:4 around to cover the bases - and those users who have workloads that will run fine on 3:1 should run 3:1. (not to mention the range of users who need 4GB _userspace_.) but i'm quite strongly convinced that 'getting rid' of the 'pte chain overhead' in favor of questionable lowmem space gains for a dying (high-end server) platform is very shortsighted. [getting rid of them for purposes of the 64-bit platforms could be OK, but the argumentation isnt that strong there i think.] Ingo ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-07 8:41 ` Ingo Molnar @ 2004-03-07 10:29 ` Nick Piggin 2004-03-07 17:33 ` Andrea Arcangeli 2004-03-07 17:24 ` Andrea Arcangeli 1 sibling, 1 reply; 100+ messages in thread From: Nick Piggin @ 2004-03-07 10:29 UTC (permalink / raw) To: Ingo Molnar Cc: Andrea Arcangeli, Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel Ingo Molnar wrote: >* Andrea Arcangeli <andrea@suse.de> wrote: > > >>[...] but I'm quite confortable to say that up to 16G (included) 4:4 >>is worthless unless you've to deal with the rmap waste IMHO. [...] >> > >i've seen workloads on 8G RAM systems that easily filled up the ~800 MB >lowmem zone. (it had to do with many files and having them as a big >dentry cache, so yes, it's unfixable unless you start putting inodes >into highmem which is crazy. And yes, performance broke down unless most >of the dentries/inodes were cached in lowmem.) > > If you still have any of these workloads around, they would be good to test on the memory management changes in Andrew's mm tree which should correctly balance slab on highmem systems. Linus' tree has a few problems here. But if you really have a lot more than 800MB of active dentries, then maybe 4:4 would be a win? ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-07 10:29 ` Nick Piggin @ 2004-03-07 17:33 ` Andrea Arcangeli 2004-03-08 5:15 ` Nick Piggin 0 siblings, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-07 17:33 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel On Sun, Mar 07, 2004 at 09:29:37PM +1100, Nick Piggin wrote: > > > Ingo Molnar wrote: > > >* Andrea Arcangeli <andrea@suse.de> wrote: > > > > > >>[...] but I'm quite confortable to say that up to 16G (included) 4:4 > >>is worthless unless you've to deal with the rmap waste IMHO. [...] > >> > > > >i've seen workloads on 8G RAM systems that easily filled up the ~800 MB > >lowmem zone. (it had to do with many files and having them as a big > >dentry cache, so yes, it's unfixable unless you start putting inodes > >into highmem which is crazy. And yes, performance broke down unless most > >of the dentries/inodes were cached in lowmem.) > > > > > > If you still have any of these workloads around, they would be I also have workloads that would die with 4:4 and rmap. the question is if they tested this in the stock 2.4 or 2.4-aa VM, or if this was tested on kernels with rmap. most kernels are also broken w.r.t. lowmem reservation, there are huge vm design breakages in tons of 2.4 out there, those breakages would generate lomwm shortages too, so just saying the 8G box runs out of lowmem is meaningless unless we know exactly which kind of 2.4 incarnation was running on that box. For istance google was running out of lowmem zone even on 2.5G boxes until I fixed it, and the fix was merged in mainline only around 2.4.23, so unless I'm sure all relevant fixes were applied the 8G runs out of lowmem means nothing to me, since it was running out of lowmem for me too for ages even on the 4G boxes until I've fixed all those issues in the vm, not related to the pinned amount of memory. alternatively if they can count the number of tasks, and the number of files open, we can do the math and count the mbytes of lowmem pinned, that as well can demonstrate it was a limitation of the 3:1 and not a design bug of the vm in-use on that box. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-07 17:33 ` Andrea Arcangeli @ 2004-03-08 5:15 ` Nick Piggin 0 siblings, 0 replies; 100+ messages in thread From: Nick Piggin @ 2004-03-08 5:15 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel Andrea Arcangeli wrote: >On Sun, Mar 07, 2004 at 09:29:37PM +1100, Nick Piggin wrote: > >> >>Ingo Molnar wrote: >> >> >>>* Andrea Arcangeli <andrea@suse.de> wrote: >>> >>> >>> >>>>[...] but I'm quite confortable to say that up to 16G (included) 4:4 >>>>is worthless unless you've to deal with the rmap waste IMHO. [...] >>>> >>>> >>>i've seen workloads on 8G RAM systems that easily filled up the ~800 MB >>>lowmem zone. (it had to do with many files and having them as a big >>>dentry cache, so yes, it's unfixable unless you start putting inodes >>>into highmem which is crazy. And yes, performance broke down unless most >>>of the dentries/inodes were cached in lowmem.) >>> >>> >>> >>If you still have any of these workloads around, they would be >> > >I also have workloads that would die with 4:4 and rmap. > > I don't doubt that, and of course no amount of tinkering with reclaim will help where you are dying due to pinned lowmem. Ingo's workload sounded like slab cache reclaim improvements in recent mm kernels might possibly help. I was purely interested in this for testing the reclaim changes. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-07 8:41 ` Ingo Molnar 2004-03-07 10:29 ` Nick Piggin @ 2004-03-07 17:24 ` Andrea Arcangeli 1 sibling, 0 replies; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-07 17:24 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel On Sun, Mar 07, 2004 at 09:41:20AM +0100, Ingo Molnar wrote: > > * Andrea Arcangeli <andrea@suse.de> wrote: > > > [...] but I'm quite confortable to say that up to 16G (included) 4:4 > > is worthless unless you've to deal with the rmap waste IMHO. [...] > > i've seen workloads on 8G RAM systems that easily filled up the ~800 MB > lowmem zone. (it had to do with many files and having them as a big was that a kernel with rmap or w/o rmap? > but i'm quite strongly convinced that 'getting rid' of the 'pte chain > overhead' in favor of questionable lowmem space gains for a dying > (high-end server) platform is very shortsighted. [getting rid of them > for purposes of the 64-bit platforms could be OK, but the argumentation > isnt that strong there i think.] disagree, the reason I'm doing it is for the 64bit platforms, I can't care less about x86. the vm is dogslow with rmap. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 15:26 ` Ingo Molnar 2004-03-05 15:53 ` Andrea Arcangeli @ 2004-03-05 21:28 ` Martin J. Bligh 1 sibling, 0 replies; 100+ messages in thread From: Martin J. Bligh @ 2004-03-05 21:28 UTC (permalink / raw) To: Ingo Molnar, Andrea Arcangeli Cc: Peter Zaitsev, Andrew Morton, riel, linux-kernel > * Andrea Arcangeli <andrea@suse.de> wrote: > >> It's a nogo for 64G but I would be really pleased to see a workload >> triggering the zone-normal shortage in 32G, I've never seen any one. >> [...] > > have you tried TPC-C/TPC-H? We're doing those here. Publishing results will be tricky due to their draconian rules, but I'm sure you'll be able to read between the lines ;-) OASB (Oracle apps) is the other total killer I've found in the past. M. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 14:58 ` Andrea Arcangeli 2004-03-05 15:26 ` Ingo Molnar @ 2004-03-05 18:42 ` Martin J. Bligh 2004-03-05 19:13 ` Andrea Arcangeli 1 sibling, 1 reply; 100+ messages in thread From: Martin J. Bligh @ 2004-03-05 18:42 UTC (permalink / raw) To: Andrea Arcangeli, Ingo Molnar Cc: Peter Zaitsev, Andrew Morton, riel, linux-kernel > It's a nogo for 64G but I would be really pleased to see a workload > triggering the zone-normal shortage in 32G, I've never seen any one. And > 16G has even more margin. The things I've seen consume ZONE_NORMAL (which aren't reclaimable) are: 1. mem_map (obviously) (64GB = 704MB of mem_map) 2. Buffer_heads (much improved in 2.6, though not completely gone IIRC) 3. Pagetables (pte_highmem helps, pmds are existant, but less of a problem, 10,000 tasks would be 117MB) 4. Kernel stacks (10,000 tasks would be 78MB - 4K stacks would help obviously) 5. rmap chains - this is the real killer without objrmap (even 1000 tasks sharing a 2GB shmem segment will kill you without large pages). 6. vmas - wierdo Oracle things before remap_file_pages especially. I may have forgotten some, but I think those were the main ones. 10,000 tasks is a little heavy, but it's easy to scale the numbers around. I guess my main point is that it's often as much to do with the number of tasks as it is with just the larger amount of memory - but bigger machines tend to run more tasks, so it often goes hand-in-hand. Also bear in mind that as memory gets tight, the reclaimable things like dcache and icache will get shrunk, which will hurt performance itself too, so some of the cost of 4/4 is paid back there too. Without shared pagetables, we may need highpte even on 4/4, which kind of sucks (can be 10% or so hit). M. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 18:42 ` Martin J. Bligh @ 2004-03-05 19:13 ` Andrea Arcangeli 2004-03-05 19:55 ` Martin J. Bligh 0 siblings, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-05 19:13 UTC (permalink / raw) To: Martin J. Bligh Cc: Ingo Molnar, Peter Zaitsev, Andrew Morton, riel, linux-kernel On Fri, Mar 05, 2004 at 10:42:55AM -0800, Martin J. Bligh wrote: > > It's a nogo for 64G but I would be really pleased to see a workload > > triggering the zone-normal shortage in 32G, I've never seen any one. And > > 16G has even more margin. > > The things I've seen consume ZONE_NORMAL (which aren't reclaimable) are: > > 1. mem_map (obviously) (64GB = 704MB of mem_map) I was asking 32G, that's half of that and it leaves 500M free. 64G is a no-way with 3:1. > > 2. Buffer_heads (much improved in 2.6, though not completely gone IIRC) the vm is able to reclaim them before running oom, though it has a performance cost. > 3. Pagetables (pte_highmem helps, pmds are existant, but less of a problem, > 10,000 tasks would be 117MB) pmds seems 13M for 10000 tasks, but maybe I did the math wrong. > > 4. Kernel stacks (10,000 tasks would be 78MB - 4K stacks would help obviously) 4k stacks then need to allocate the task struct in the heap, though it still saves ram, but it's not very different. > > 5. rmap chains - this is the real killer without objrmap (even 1000 tasks > sharing a 2GB shmem segment will kill you without large pages). this overhead doesn't exist in 2.4. > 6. vmas - wierdo Oracle things before remap_file_pages especially. this is one of the main issues of 2.4. > I may have forgotten some, but I think those were the main ones. 10,000 tasks > is a little heavy, but it's easy to scale the numbers around. I guess my main > point is that it's often as much to do with the number of tasks as it is > with just the larger amount of memory - but bigger machines tend to run more > tasks, so it often goes hand-in-hand. yes, an 8-way with 32G it's unlikely that can scale up to 10000 tasks, regardless, but maybe things change with a 32-way 32G. The main thing you didn't mention is the overhead in the per-cpu data structures, that alone generates an overhead of several dozen mbytes only in the page allocator, without accounting the slab caches, pagetable caches etc.. putting an high limit to the per-cpu caches should make a 32-way 32G work fine with 3:1 too though. 8-way is fine with 32G currently. other relevant things are the fs stuff like file handles per task and other pinned slab things. > Also bear in mind that as memory gets tight, the reclaimable things like > dcache and icache will get shrunk, which will hurt performance itself too, for these workloads (the 10000 tasks are the workloads we know very well) dcache/icache doesn't matter, and still I find 3:1 a more generic kernel than 4:4 for random workloads too. And if you don't run the 10000 tasks workload then you've the normal-zone free to use for dcache anyways. > so some of the cost of 4/4 is paid back there too. Without shared pagetables, > we may need highpte even on 4/4, which kind of sucks (can be 10% or so hit). I think pte-highmem is definitely needed on 4:4 too, even if you use hugetlbfs that won't cover PAE and the granular window which is quite a lot of the ram. Overall shared pageteables doesn't payoff for its complexity, rather than sharing the pagetables it's better not to allocate them in the first place ;) (hugetlbfs/largepages). The pratical limit of the hardware was 5k tasks, not a kernel issue. Your 10k example has never been tested, but obviously at some point a limit will trigger (eventually the get_pid will stop finding a free pid too ;) ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 19:13 ` Andrea Arcangeli @ 2004-03-05 19:55 ` Martin J. Bligh 2004-03-05 20:29 ` Andrea Arcangeli 0 siblings, 1 reply; 100+ messages in thread From: Martin J. Bligh @ 2004-03-05 19:55 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Peter Zaitsev, Andrew Morton, riel, linux-kernel >> The things I've seen consume ZONE_NORMAL (which aren't reclaimable) are: >> >> 1. mem_map (obviously) (64GB = 704MB of mem_map) > > I was asking 32G, that's half of that and it leaves 500M free. 64G is a > no-way with 3:1. Yup. >> 2. Buffer_heads (much improved in 2.6, though not completely gone IIRC) > > the vm is able to reclaim them before running oom, though it has a > performance cost. Didn't used to in SLES8 at least. maybe it does in 2.6 now, I know Andrew worked on that a lot. >> 3. Pagetables (pte_highmem helps, pmds are existant, but less of a problem, >> 10,000 tasks would be 117MB) > > pmds seems 13M for 10000 tasks, but maybe I did the math wrong. 3 pages per task = 12K per task = 120,000Kb. Or that's the way I figured it at least. >> 4. Kernel stacks (10,000 tasks would be 78MB - 4K stacks would help obviously) > > 4k stacks then need to allocate the task struct in the heap, though it > still saves ram, but it's not very different. In 2.6, I think the task struct is outside the kernel stack either way. Maybe you were pointing out something else? not sure. > The main thing you didn't mention is the overhead in the per-cpu data > structures, that alone generates an overhead of several dozen mbytes > only in the page allocator, without accounting the slab caches, > pagetable caches etc.. putting an high limit to the per-cpu caches > should make a 32-way 32G work fine with 3:1 too though. 8-way is > fine with 32G currently. Humpf. Do you have a hard figure on how much it actually is per cpu? > other relevant things are the fs stuff like file handles per task and > other pinned slab things. Yeah, that was a huge one we forgot ... sysfs. Particularly with large numbers of disks, IIRC, though other resources might generate similar issues. > I think pte-highmem is definitely needed on 4:4 too, even if you use > hugetlbfs that won't cover PAE and the granular window which is quite a > lot of the ram. > > Overall shared pageteables doesn't payoff for its complexity, rather > than sharing the pagetables it's better not to allocate them in the > first place ;) (hugetlbfs/largepages). That might be another approach, yes ... some more implicit allocation stuff would help here - modifying ISV apps is a PITA to get done, and takes *forever*. Adam wrote some patches that are sitting in my tree, some of which were ported forward from SLES8. But then we get into massive problems with them not being swappable, so you need capabilities, etc, etc. Ugh. > The pratical limit of the hardware was 5k tasks, not a kernel issue. > Your 10k example has never been tested, but obviously at some point a > limit will trigger (eventually the get_pid will stop finding a free pid > too ;) You mean with the 8cpu box you mentioned above? Yes, probably 5K. Larger boxes will get progressively scarier ;-) What scares me more is that we can sit playing counting games all day, but there's always something we will forget. So I'm not keen on playing brinkmanship games with customers systems ;-) M. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 19:55 ` Martin J. Bligh @ 2004-03-05 20:29 ` Andrea Arcangeli 2004-03-05 20:41 ` Andrew Morton 0 siblings, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-05 20:29 UTC (permalink / raw) To: Martin J. Bligh Cc: Ingo Molnar, Peter Zaitsev, Andrew Morton, riel, linux-kernel On Fri, Mar 05, 2004 at 11:55:05AM -0800, Martin J. Bligh wrote: > Didn't used to in SLES8 at least. maybe it does in 2.6 now, I know Andrew > worked on that a lot. it should in every SLES8 kernel out there too (it wasn't in mainline until very recently), see the related bhs stuff. > In 2.6, I think the task struct is outside the kernel stack either way. > Maybe you were pointing out something else? not sure. I meant that making the kernel stack 4k pretty much requires removing the task_struct, making it 4k w/o removing the task_struct sounds too small. > > The main thing you didn't mention is the overhead in the per-cpu data > > structures, that alone generates an overhead of several dozen mbytes > > only in the page allocator, without accounting the slab caches, > > pagetable caches etc.. putting an high limit to the per-cpu caches > > should make a 32-way 32G work fine with 3:1 too though. 8-way is > > fine with 32G currently. > > Humpf. Do you have a hard figure on how much it actually is per cpu? not a definitive one, but it's sure more than 2m per cpu, could be 3m per cpu. > > other relevant things are the fs stuff like file handles per task and > > other pinned slab things. > > Yeah, that was a huge one we forgot ... sysfs. Particularly with large > numbers of disks, IIRC, though other resources might generate similar > issues. which doesn't need to be mounted during production and hotplug should mount read it and unmount. It's worthless to leave it mounted. Only root-only hardware related stuff should be in sysfs, everything else that has been abstracted at the kernel level (transparent to applications) should remain in /proc. unmounting /proc hurts the production systems, unmounting sysfs should not. > You mean with the 8cpu box you mentioned above? Yes, probably 5K. Larger > boxes will get progressively scarier ;-) yes. > What scares me more is that we can sit playing counting games all day, > but there's always something we will forget. So I'm not keen on playing > brinkmanship games with customers systems ;-) this is true for 4:4 too. Also with 2.4 the system will return -ENOMEM, not like 2.6 that lockup the box. so it's not a fatal thing if a certain kernel can't sustain a certain workload in a certain hardware, just like it's not a fatal thing if your run out of memory for the pagetables on a 64bit architecture with 64bit kernel. My only object is to make it feasible to run the most high end workloads in the most high end hardware with a good safety margin, knowing if something goes wrong the worst that can happen is that a syscall returns -ENOMEM. There will always be a malicious workload able to fill the zone-normal, if you fork off a tons of tasks, and you open a gazzillon of sockets and you flood all of them at the same time to fill all receive windows you'll fill your cool 4G zone-normal of 4:4 in half a second with a 10gigabit NIC. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 20:29 ` Andrea Arcangeli @ 2004-03-05 20:41 ` Andrew Morton 2004-03-05 21:07 ` Andrea Arcangeli 0 siblings, 1 reply; 100+ messages in thread From: Andrew Morton @ 2004-03-05 20:41 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: mbligh, mingo, peter, riel, linux-kernel Andrea Arcangeli <andrea@suse.de> wrote: > > > > The main thing you didn't mention is the overhead in the per-cpu data > > > structures, that alone generates an overhead of several dozen mbytes > > > only in the page allocator, without accounting the slab caches, > > > pagetable caches etc.. putting an high limit to the per-cpu caches > > > should make a 32-way 32G work fine with 3:1 too though. 8-way is > > > fine with 32G currently. > > > > Humpf. Do you have a hard figure on how much it actually is per cpu? > > not a definitive one, but it's sure more than 2m per cpu, could be 3m > per cpu. It'll average out to 68 pages per cpu. (4 in ZONE_DMA, 64 in ZONE_NORMAL). That's eight megs on 32-way. Maybe it can be trimmed back a bit, but on 32-way you probably want the locking amortisation more than the 8 megs. The settings we have in there are still pretty much guesswork. I don't think anyone has done any serious tuning on them. Any differences are likely to be small. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 20:41 ` Andrew Morton @ 2004-03-05 21:07 ` Andrea Arcangeli 2004-03-05 22:12 ` Andrew Morton 0 siblings, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-05 21:07 UTC (permalink / raw) To: Andrew Morton; +Cc: mbligh, mingo, peter, riel, linux-kernel On Fri, Mar 05, 2004 at 12:41:19PM -0800, Andrew Morton wrote: > Andrea Arcangeli <andrea@suse.de> wrote: > > > > > > The main thing you didn't mention is the overhead in the per-cpu data > > > > structures, that alone generates an overhead of several dozen mbytes > > > > only in the page allocator, without accounting the slab caches, > > > > pagetable caches etc.. putting an high limit to the per-cpu caches > > > > should make a 32-way 32G work fine with 3:1 too though. 8-way is > > > > fine with 32G currently. > > > > > > Humpf. Do you have a hard figure on how much it actually is per cpu? > > > > not a definitive one, but it's sure more than 2m per cpu, could be 3m > > per cpu. > > It'll average out to 68 pages per cpu. (4 in ZONE_DMA, 64 in ZONE_NORMAL). 3m per cpu with all 3m in zone normal. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 21:07 ` Andrea Arcangeli @ 2004-03-05 22:12 ` Andrew Morton 0 siblings, 0 replies; 100+ messages in thread From: Andrew Morton @ 2004-03-05 22:12 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: mbligh, mingo, peter, riel, linux-kernel Andrea Arcangeli <andrea@suse.de> wrote: > > On Fri, Mar 05, 2004 at 12:41:19PM -0800, Andrew Morton wrote: > > Andrea Arcangeli <andrea@suse.de> wrote: > > > > > > > > The main thing you didn't mention is the overhead in the per-cpu data > > > > > structures, that alone generates an overhead of several dozen mbytes > > > > > only in the page allocator, without accounting the slab caches, > > > > > pagetable caches etc.. putting an high limit to the per-cpu caches > > > > > should make a 32-way 32G work fine with 3:1 too though. 8-way is > > > > > fine with 32G currently. > > > > > > > > Humpf. Do you have a hard figure on how much it actually is per cpu? > > > > > > not a definitive one, but it's sure more than 2m per cpu, could be 3m > > > per cpu. > > > > It'll average out to 68 pages per cpu. (4 in ZONE_DMA, 64 in ZONE_NORMAL). > > 3m per cpu with all 3m in zone normal. In the page allocator? How did you arrive at this figure? ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 14:15 ` Andrea Arcangeli 2004-03-05 14:32 ` Ingo Molnar @ 2004-03-05 14:34 ` Ingo Molnar 2004-03-05 14:59 ` Andrea Arcangeli 1 sibling, 1 reply; 100+ messages in thread From: Ingo Molnar @ 2004-03-05 14:34 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel * Andrea Arcangeli <andrea@suse.de> wrote: > > vsyscall-sys_gettimeofday and vsyscall-sys_time could help quite some ^^^^^^^^^^^^^^^^^ > > for mysql. Also, the highly threaded nature of mysql on the same MM > > he said he doesn't use gettimeofday frequently, so most of the flushes > are from other syscalls. you are not reading Pete's and my emails too carefully, are you? Pete said: > [...] MySQL does not use gettimeofday very frequently now, actually it > uses time() most of the time, as some platforms used to have huge ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > performance problems with gettimeofday() in the past. > > The amount of gettimeofday() use will increase dramatically in the > future so it is good to know about this matter. Ingo ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 14:34 ` Ingo Molnar @ 2004-03-05 14:59 ` Andrea Arcangeli 2004-03-05 15:02 ` Ingo Molnar 0 siblings, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-05 14:59 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel On Fri, Mar 05, 2004 at 03:34:25PM +0100, Ingo Molnar wrote: > > * Andrea Arcangeli <andrea@suse.de> wrote: > > > > vsyscall-sys_gettimeofday and vsyscall-sys_time could help quite some > ^^^^^^^^^^^^^^^^^ > > > for mysql. Also, the highly threaded nature of mysql on the same MM > > > > he said he doesn't use gettimeofday frequently, so most of the flushes > > are from other syscalls. > > you are not reading Pete's and my emails too carefully, are you? Pete > said: I thought time() wouldn't be called more than 1 per second anyways, why would anyone call time more than 1 per second? > > > [...] MySQL does not use gettimeofday very frequently now, actually it > > uses time() most of the time, as some platforms used to have huge > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > performance problems with gettimeofday() in the past. > > > > The amount of gettimeofday() use will increase dramatically in the > > future so it is good to know about this matter. > > Ingo ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 14:59 ` Andrea Arcangeli @ 2004-03-05 15:02 ` Ingo Molnar [not found] ` <20040305150225.GA13237@elte.hu.suse.lists.linux.kernel> ` (2 more replies) 0 siblings, 3 replies; 100+ messages in thread From: Ingo Molnar @ 2004-03-05 15:02 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel * Andrea Arcangeli <andrea@suse.de> wrote: > I thought time() wouldn't be called more than 1 per second anyways, > why would anyone call time more than 1 per second? if mysql in fact calls time() frequently, then it should rather start a worker thread that updates a global time variable every second. Ingo ^ permalink raw reply [flat|nested] 100+ messages in thread
[parent not found: <20040305150225.GA13237@elte.hu.suse.lists.linux.kernel>]
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) [not found] ` <20040305150225.GA13237@elte.hu.suse.lists.linux.kernel> @ 2004-03-05 15:51 ` Andi Kleen 2004-03-05 16:23 ` Ingo Molnar 0 siblings, 1 reply; 100+ messages in thread From: Andi Kleen @ 2004-03-05 15:51 UTC (permalink / raw) To: Ingo Molnar Cc: Andrea Arcangeli, Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel Ingo Molnar <mingo@elte.hu> writes: > * Andrea Arcangeli <andrea@suse.de> wrote: > > > I thought time() wouldn't be called more than 1 per second anyways, > > why would anyone call time more than 1 per second? > > if mysql in fact calls time() frequently, then it should rather start a > worker thread that updates a global time variable every second. I just fixed the x86-64 vsyscall vtime() to only read the user mapped __xtime.tv_sec. This should be equivalent. Only drawback is that if a timer tick is delayed for too long it won't fix that, but I guess that's reasonable for a 1s resolution. -Andi ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 15:51 ` Andi Kleen @ 2004-03-05 16:23 ` Ingo Molnar 2004-03-05 16:39 ` Andrea Arcangeli 2004-03-10 13:21 ` Andi Kleen 0 siblings, 2 replies; 100+ messages in thread From: Ingo Molnar @ 2004-03-05 16:23 UTC (permalink / raw) To: Andi Kleen Cc: Andrea Arcangeli, Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel * Andi Kleen <ak@suse.de> wrote: > > if mysql in fact calls time() frequently, then it should rather start a > > worker thread that updates a global time variable every second. > > I just fixed the x86-64 vsyscall vtime() to only read the user mapped > __xtime.tv_sec. This should be equivalent. [...] yeah - nice! > [...] Only drawback is that if a timer tick is delayed for too long it > won't fix that, but I guess that's reasonable for a 1s resolution. what do you mean by delayed? Ingo ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 16:23 ` Ingo Molnar @ 2004-03-05 16:39 ` Andrea Arcangeli 2004-03-07 8:16 ` Ingo Molnar 2004-03-10 13:21 ` Andi Kleen 1 sibling, 1 reply; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-05 16:39 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel On Fri, Mar 05, 2004 at 05:23:19PM +0100, Ingo Molnar wrote: > what do you mean by delayed? if the timer softirq doesn't run and wall_jiffies doesn't increase, we won't be able to account for it, so time() will return a time in the past, it will potentially go backwards precisely 1/HZ seconds every tick that isn't executing the timer softirq. I tend to agree for a 1sec resultion that's not a big deal though if you run: gettimeofday() time() gettimeofday may say the time of the day is 17:39:10 and time may tell 17:39:09 ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 16:39 ` Andrea Arcangeli @ 2004-03-07 8:16 ` Ingo Molnar 0 siblings, 0 replies; 100+ messages in thread From: Ingo Molnar @ 2004-03-07 8:16 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andi Kleen, Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel * Andrea Arcangeli <andrea@suse.de> wrote: > > what do you mean by delayed? > > if the timer softirq doesn't run and wall_jiffies doesn't increase, we > won't be able to account for it, so time() will return a time in the > past, it will potentially go backwards precisely 1/HZ seconds every > tick that isn't executing the timer softirq. [...] we agree that this all is not an issue, but the reasons are different from what you describe. wall_jiffies (and, more importantly, xtime.tv_sec - which is the clock source used by sys_time()) is updated from hardirq context - so softirq delay cannot impact it. gettimeofday() and time() are unsynchronized clocks, and time() will almost always return a time less than the current time - due to rounding down. in the moments where there's a timer IRQ pending (or the timer IRQ's time update effect is delayed eg. due to contention on xtime_lock) gettimeofday() can estimate the current time past the timer tick, at which moment the inaccuracy of time() can be briefly higher than 1 second. (in most cases it should be 1 second + delta) > [...] I tend to agree for a 1sec resultion that's not a big deal > though if you run: > > gettimeofday() > time() > > gettimeofday may say the time of the day is 17:39:10 and time may tell > 17:39:09 nobody should rely on gettimeofday() and time() being synchronized on the second level. Typically the delta will be [0 ... 0.999999 ] seconds, occasionally it can get larger. and this has nothing to do with using vsyscalls and it can already happen. xtime.tv_sec is used without any synchronization so even if xtime were synchronized with gettimeofday() [eg. by do_gettimeofday() noticing that xtime.tv_sec needs an update] - the access is not serialized on SMP. Ingo ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 16:23 ` Ingo Molnar 2004-03-05 16:39 ` Andrea Arcangeli @ 2004-03-10 13:21 ` Andi Kleen 2004-03-05 16:42 ` Andrea Arcangeli 2004-03-05 16:49 ` Ingo Molnar 1 sibling, 2 replies; 100+ messages in thread From: Andi Kleen @ 2004-03-10 13:21 UTC (permalink / raw) To: Ingo Molnar; +Cc: andrea, peter, akpm, riel, mbligh, linux-kernel On Fri, 5 Mar 2004 17:23:19 +0100 Ingo Molnar <mingo@elte.hu> wrote: > > [...] Only drawback is that if a timer tick is delayed for too long it > > won't fix that, but I guess that's reasonable for a 1s resolution. > > what do you mean by delayed? Normal gettimeofday can "fix" lost timer ticks because it computes the true offset to the last timer interrupt using the TSC or other means. xtime is always the last tick without any correction. If it got delayed too much the result will be out of date. -Andi ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-10 13:21 ` Andi Kleen @ 2004-03-05 16:42 ` Andrea Arcangeli 2004-03-05 16:49 ` Ingo Molnar 1 sibling, 0 replies; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-05 16:42 UTC (permalink / raw) To: Andi Kleen; +Cc: Ingo Molnar, peter, akpm, riel, mbligh, linux-kernel On Wed, Mar 10, 2004 at 02:21:25PM +0100, Andi Kleen wrote: > On Fri, 5 Mar 2004 17:23:19 +0100 > Ingo Molnar <mingo@elte.hu> wrote: > > > > [...] Only drawback is that if a timer tick is delayed for too long it > > > won't fix that, but I guess that's reasonable for a 1s resolution. > > > > what do you mean by delayed? > > Normal gettimeofday can "fix" lost timer ticks because it computes the true > offset to the last timer interrupt using the TSC or other means. xtime > is always the last tick without any correction. If it got delayed too much > the result will be out of date. lost timer ticks doesn't worry me that much, they mess up the system time persistently anyways with 2.4 (and not all platforms uses the tsc anyways, even on x86), it's only the lost softirqs that concerns me. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-10 13:21 ` Andi Kleen 2004-03-05 16:42 ` Andrea Arcangeli @ 2004-03-05 16:49 ` Ingo Molnar 2004-03-05 16:58 ` Andrea Arcangeli 1 sibling, 1 reply; 100+ messages in thread From: Ingo Molnar @ 2004-03-05 16:49 UTC (permalink / raw) To: Andi Kleen; +Cc: andrea, peter, akpm, riel, mbligh, linux-kernel * Andi Kleen <ak@suse.de> wrote: > > > [...] Only drawback is that if a timer tick is delayed for too long it > > > won't fix that, but I guess that's reasonable for a 1s resolution. > > > > what do you mean by delayed? > > Normal gettimeofday can "fix" lost timer ticks because it computes the > true offset to the last timer interrupt using the TSC or other means. > xtime is always the last tick without any correction. If it got > delayed too much the result will be out of date. yeah - i doubt the softirq delay is a real issue. Ingo ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 16:49 ` Ingo Molnar @ 2004-03-05 16:58 ` Andrea Arcangeli 0 siblings, 0 replies; 100+ messages in thread From: Andrea Arcangeli @ 2004-03-05 16:58 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andi Kleen, peter, akpm, riel, mbligh, linux-kernel On Fri, Mar 05, 2004 at 05:49:02PM +0100, Ingo Molnar wrote: > > * Andi Kleen <ak@suse.de> wrote: > > > > > [...] Only drawback is that if a timer tick is delayed for too long it > > > > won't fix that, but I guess that's reasonable for a 1s resolution. > > > > > > what do you mean by delayed? > > > > Normal gettimeofday can "fix" lost timer ticks because it computes the > > true offset to the last timer interrupt using the TSC or other means. > > xtime is always the last tick without any correction. If it got > > delayed too much the result will be out of date. > > yeah - i doubt the softirq delay is a real issue. Do you think it's more likely the irq is lost? I think it's more likely the softirq takes more than 1msec than the irq is lost. If softirq takes more than 1msec we don't necessairly need to fix that, the timer code is designed to handle that case properly and the softirq is the place where to do the bulk of the work, if irq is lost we definitely need to fix that. Anyways either ways time may go backwards w.r.t. gettimeofday. I'm not saying it's a real issue though. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 15:02 ` Ingo Molnar [not found] ` <20040305150225.GA13237@elte.hu.suse.lists.linux.kernel> @ 2004-03-05 20:11 ` Jamie Lokier 2004-03-06 5:12 ` Jamie Lokier 2004-03-07 11:55 ` Ingo Molnar 2004-03-07 6:50 ` Peter Zaitsev 2 siblings, 2 replies; 100+ messages in thread From: Jamie Lokier @ 2004-03-05 20:11 UTC (permalink / raw) To: Ingo Molnar Cc: Andrea Arcangeli, Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel Ingo Molnar wrote: > if mysql in fact calls time() frequently, then it should rather start a > worker thread that updates a global time variable every second. That has the same problem as discussed later in this thread with vsyscall-time: the worker thread may not run immediately it is woken, and also setitimer() and select() round up the delay a little more then expected, so sometimes the global time variable will be out of date and misordered w.r.t. gettimeofday() and stat() results of recently modified files. Also, if there's paging the variable may be out of date by quite a long time, so mlock() should be used to remove that aspect of the delay. I don't know if such delays a problem for MySQL. -- Jamie ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 20:11 ` Jamie Lokier @ 2004-03-06 5:12 ` Jamie Lokier 2004-03-06 12:56 ` Magnus Naeslund(t) 2004-03-07 11:55 ` Ingo Molnar 1 sibling, 1 reply; 100+ messages in thread From: Jamie Lokier @ 2004-03-06 5:12 UTC (permalink / raw) To: Ingo Molnar Cc: Andrea Arcangeli, Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel Jamie Lokier wrote: > Ingo Molnar wrote: > > if mysql in fact calls time() frequently, then it should rather start a > > worker thread that updates a global time variable every second. > > That has the same problem as discussed later in this thread with > vsyscall-time: the worker thread may not run immediately it is woken, > and also setitimer() and select() round up the delay a little more > then expected, so sometimes the global time variable will be out of > date and misordered. > > I don't know if such delays a problem for MySQL. I still don't know about MySQL, but I have just encounted some code of my own which does break if time() returns significantly out of date values. Any code which is structured like this will break: time_t timeout = time(0) + TIMEOUT_IN_SECONDS; do { /* Do some stuff which takes a little while. */ } while (time(0) <= timeout); It goes wrong when time() returns a value that is in the past, and then jumps forward to the correct time suddenly. The timeout of the above code is reduced by the size of that jump. If the jump is larger than TIMEOUT_IN_SECONDS, the timeout mechanism is defeated completely. That sort of code is a prime candidate for the method of using a worker thread updating a global variable, so it's really important to to take care when using it. -- Jamie ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-06 5:12 ` Jamie Lokier @ 2004-03-06 12:56 ` Magnus Naeslund(t) 2004-03-06 13:13 ` Magnus Naeslund(t) 0 siblings, 1 reply; 100+ messages in thread From: Magnus Naeslund(t) @ 2004-03-06 12:56 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel Jamie Lokier wrote: [snip] > > Any code which is structured like this will break: > > time_t timeout = time(0) + TIMEOUT_IN_SECONDS; > > do { > /* Do some stuff which takes a little while. */ > } while (time(0) <= timeout); > > It goes wrong when time() returns a value that is in the past, and > then jumps forward to the correct time suddenly. The timeout of the > above code is reduced by the size of that jump. If the jump is larger > than TIMEOUT_IN_SECONDS, the timeout mechanism is defeated completely. > > That sort of code is a prime candidate for the method of using a > worker thread updating a global variable, so it's really important to > to take care when using it. > But isn't this kind of code a known buggy way of implementing timeouts? Shouldn't it be like: time_t x = time(0); do { ... } while (time(0) - x >= TIMEOUT_IN_SECONDS); Ofcourse it can't handle times in the past, but it won't get easily hung with regards to leaps or wraparounds (if used with other functions). Regards Magnus ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-06 12:56 ` Magnus Naeslund(t) @ 2004-03-06 13:13 ` Magnus Naeslund(t) 0 siblings, 0 replies; 100+ messages in thread From: Magnus Naeslund(t) @ 2004-03-06 13:13 UTC (permalink / raw) To: Magnus Naeslund(t); +Cc: Jamie Lokier, linux-kernel Magnus Naeslund(t) wrote: > > But isn't this kind of code a known buggy way of implementing timeouts? > Shouldn't it be like: > > time_t x = time(0); > do { > ... > } while (time(0) - x >= TIMEOUT_IN_SECONDS); I meant: } while (time(0) - x < TIMEOUT_IN_SECONDS); Also if time_t is signed, that needs to be taken care of. Magnus - butterfingers ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 20:11 ` Jamie Lokier 2004-03-06 5:12 ` Jamie Lokier @ 2004-03-07 11:55 ` Ingo Molnar 1 sibling, 0 replies; 100+ messages in thread From: Ingo Molnar @ 2004-03-07 11:55 UTC (permalink / raw) To: Jamie Lokier Cc: Andrea Arcangeli, Peter Zaitsev, Andrew Morton, riel, mbligh, linux-kernel * Jamie Lokier <jamie@shareable.org> wrote: > Ingo Molnar wrote: > > if mysql in fact calls time() frequently, then it should rather start a > > worker thread that updates a global time variable every second. > > That has the same problem as discussed later in this thread with > vsyscall-time: the worker thread may not run immediately it is woken, > and also setitimer() and select() round up the delay a little more > then expected, so sometimes the global time variable will be out of > date and misordered w.r.t. gettimeofday() and stat() results of > recently modified files. we dont have any guarantees wrt. the synchronization of the time() and the gettimeofday() clocks - irrespective of vsyscalls, do we? Ingo ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-05 15:02 ` Ingo Molnar [not found] ` <20040305150225.GA13237@elte.hu.suse.lists.linux.kernel> 2004-03-05 20:11 ` Jamie Lokier @ 2004-03-07 6:50 ` Peter Zaitsev 2 siblings, 0 replies; 100+ messages in thread From: Peter Zaitsev @ 2004-03-07 6:50 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrea Arcangeli, Andrew Morton, riel, mbligh, linux-kernel On Fri, 2004-03-05 at 07:02, Ingo Molnar wrote: > * Andrea Arcangeli <andrea@suse.de> wrote: > > > I thought time() wouldn't be called more than 1 per second anyways, > > why would anyone call time more than 1 per second? > > if mysql in fact calls time() frequently, then it should rather start a > worker thread that updates a global time variable every second. Ingo, Andrea, I would not say MySQL calls time that often, it is normally 2 times per query (to measure query execution time), might be couple of times more. Looking at typical profiling results it takes much less than 1% of time, even for very simple query loads. Rather than changing design how time is computed I think we would better to go to better accuracy - nowadays 1 second is far too raw. -- Peter Zaitsev, Senior Support Engineer MySQL AB, www.mysql.com Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL) http://www.mysql.com/uc2004/ ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-28 6:10 ` Martin J. Bligh 2004-02-28 6:43 ` Andrea Arcangeli @ 2004-03-02 9:10 ` Kurt Garloff 2004-03-02 15:32 ` Martin J. Bligh 1 sibling, 1 reply; 100+ messages in thread From: Kurt Garloff @ 2004-03-02 9:10 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Linux kernel list [-- Attachment #1: Type: text/plain, Size: 503 bytes --] On Fri, Feb 27, 2004 at 10:10:22PM -0800, Martin J. Bligh wrote: > Why is it 2.7GB with both 3:1 and 4:4 ... surely it can get bigger on > 4:4 ??? You could use 3.7 on 4:4, but what's the point if you throw away the mapping constantly by flushing the TLB? Regards, -- Kurt Garloff <kurt@garloff.de> [Koeln, DE] Physics:Plasma modeling <garloff@plasimo.phys.tue.nl> [TU Eindhoven, NL] Linux: SUSE Labs (Head) <garloff@suse.de> [SUSE Nuernberg, DE] [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-03-02 9:10 ` Kurt Garloff @ 2004-03-02 15:32 ` Martin J. Bligh 0 siblings, 0 replies; 100+ messages in thread From: Martin J. Bligh @ 2004-03-02 15:32 UTC (permalink / raw) To: Kurt Garloff; +Cc: Linux kernel list > On Fri, Feb 27, 2004 at 10:10:22PM -0800, Martin J. Bligh wrote: >> Why is it 2.7GB with both 3:1 and 4:4 ... surely it can get bigger on >> 4:4 ??? > > You could use 3.7 on 4:4, but what's the point if you throw away the > mapping constantly by flushing the TLB? Normally, a bigger shm segment = higher performance. Throwing the TLB away means lower performance. Depending on the workload, the tradeoff could work out either way ... the only thing I've seen so far from someone who has measured it was hints that 4/4 was faster in some situations ... we're trying to do some more runs to confirm / deny that. Hopefully others will do the same ;-) M. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 20:29 ` Andrew Morton 2004-02-27 20:49 ` Rik van Riel 2004-02-27 21:15 ` Andrea Arcangeli @ 2004-02-27 21:42 ` Hugh Dickins 2004-02-27 23:18 ` Marcelo Tosatti 3 siblings, 0 replies; 100+ messages in thread From: Hugh Dickins @ 2004-02-27 21:42 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, andrea, linux-kernel On Fri, 27 Feb 2004, Andrew Morton wrote: > > Apart from the search problem, my main gripe with objrmap is that it > creates different handling for file-backed and anonymous memory. And the > code which extends it to anonymous memory is complex and large. I challenge that: anobjrmap ventured into more files than you wanted to change at the time, but it was not complex, and removed more than it added. Hugh ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 20:29 ` Andrew Morton ` (2 preceding siblings ...) 2004-02-27 21:42 ` Hugh Dickins @ 2004-02-27 23:18 ` Marcelo Tosatti 2004-02-27 22:39 ` Andrew Morton 3 siblings, 1 reply; 100+ messages in thread From: Marcelo Tosatti @ 2004-02-27 23:18 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, andrea, linux-kernel On Fri, 27 Feb 2004, Andrew Morton wrote: > Oh, and can we please have testcases? It's all very well to assert "it > sucks doing X and I fixed it" but it's a lot more useful if one can > distrubute testcases as well so others can evaluate the fix and can explore > alternative solutions. > > Andrea, this shmem problem is a case in point, please. > > > > in small machines the current 2.4 stock algo works just fine too, it's > > > only when the lru has the million pages queued that without my new vm > > > algo you'll do million swapouts before freeing the memleak^Wcache. > > > > Same for Arjan's O(1) VM. For machines in the single and low > > double digit number of gigabytes of memory either would work > > similarly well ... > > Case in point. We went round the O(1) page reclaim loop a year ago and I > was never able to obtain a testcase which demonstrated the problem on 2.4, > let alone on 2.6. > > I had previously found some workloads in which the 2.4 VM collapsed for > similar reasons and those were fixed with the rotate_reclaimable_page() > logic. Without testcases we will not be able to verify that anything else > needs doing. Btw, Andrew, are your testcases online somewhere? I heard once someone was going to collect VM tests to make a "official testing package", but that has never happened AFAIK. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 23:18 ` Marcelo Tosatti @ 2004-02-27 22:39 ` Andrew Morton 0 siblings, 0 replies; 100+ messages in thread From: Andrew Morton @ 2004-02-27 22:39 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: riel, andrea, linux-kernel Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote: > > Andrew, are your testcases online somewhere? Well the tools are in ext3 CVS (http://www.zip.com.au/~akpm/linux/ext3/) but the issue is how to drive them to create a particular scenario. I never wrote that down, but there's heaps and heaps of info in the changelogs: http://linux.bkbits.net:8080/linux-2.5/user=akpm/ChangeSet?nav=!-|index.html|stats|!+|index.html 22000 lines of stuff there, so `bk revtool' and a fast computer may be a more convenient navigation system. That's not particularly useful, sorry. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 19:08 ` Rik van Riel 2004-02-27 20:29 ` Andrew Morton @ 2004-02-27 20:31 ` Andrea Arcangeli 2004-02-29 6:34 ` Mike Fedyk 2 siblings, 0 replies; 100+ messages in thread From: Andrea Arcangeli @ 2004-02-27 20:31 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, Andrew Morton On Fri, Feb 27, 2004 at 02:08:28PM -0500, Rik van Riel wrote: > First, let me start with one simple request. Whatever you do, > please send changes upstream in small, manageable chunks so we > can merge your improvements without destabilising the kernel. > > We should avoid the kind of disaster we had around 2.4.10... 2.4.10 was great success, the opposite of a disaster. > On Fri, 27 Feb 2004, Andrea Arcangeli wrote: > > > Then Andrew pointed out that there are complexity issues that objrmap > > can't handle but I'm not concerned about the complexity issue of objrmap > > since no real app will run into it > > We've heard the "no real app runs into it" argument before, > about various other subjects. I remember using it myself, > too, and every single time I used the "no real apps run into > it" argument I turned out to be wrong in the end. If this is an issue, than truncate() may be running into it already. So if you're worried about the vm usage you should worrk about truncate usage first. > > then I can go further and add a dummy inode for anonymous mappings too > > during COW like DaveM did originally. Only then I can remove rmap > > enterely. This last step is somewhat lower prio. > > Moving to a full objrmap from the current pte-rmap could well > be a good thing from a code cleanliness perspective. > > I'm not particularly attached to rmap.c and won't be opposed > to a replacement, provided that the replacement is also more > or less modular with the VM so plugging in an even more > improved version in the future wil be easy ;) objrmap itself should be self contained, the patch from Martin's tree is quite small too. > > in small machines the current 2.4 stock algo works just fine too, it's > > only when the lru has the million pages queued that without my new vm > > algo you'll do million swapouts before freeing the memleak^Wcache. > > Same for Arjan's O(1) VM. For machines in the single and low > double digit number of gigabytes of memory either would work > similarly well ... I don't think they would work equally well. btw the vmstat I posted was from a 16G machine btw, where the copy was stuck because of inability to find 2G of clean cache. > > > Then again, your stuff will also find pages the moment they're > > > cleaned, just at the cost of a (little?) bit more CPU time. > > > > exactly, that's an important effect of my patch and that's the only > > thing that o1 vm is taking care of, I don't think it's enough since the > > gigs of cache would still be like a memleak without my code. > > ... however, if you have a hundred gigabyte of memory, or > even more, then you cannot afford to search the inactive > list for clean pages on swapout. It will end up using too > much CPU time. I'm using various techniques so that it doesn't scan million pages in one go, and obviously I must start swapping before the very last clean cache is recycled. What I outlined is the concept. That is to "prioritize on clean cache", "prioritize" doesn't mean "all and only clean cache first". but it's true I throw cpu at the work, but there's no other way without more invasive changes, and the cpu load is not significant during swapping anyways, so it's not urgent to improve the vm further in 2.4. Just using an anchor to separate the clean scan from the dirty scan would improve things, but that as well is low priority. > The FreeBSD people found this out the hard way, even on > smaller systems... the last thing I would do is to take examples from freebsd or other unix (not only for legal reasons). > > > Shouldn't be too critical, unless you've got more than maybe > > > a hundred GB of memory, which should be a year off. > > > > I think these effects starts to be visible over 8G, the worst thing is > > that you can have 4G in a row of swapcache, in smaller systems the > > lru tends to be more intermixed. > > I've even seen the problem on small systems, where I used a > "smart" algorithm that freed the clean pages first and only > cleaned the dirty pages later. > > On my 128 MB desktop system everything was smooth, until > the point where the cache was gone and the system suddenly > faced an inactive list entirely filled with dirty pages. > > Because of this, we should do some (limited) pre-cleaning > of inactive pages. The key word here is "limited" ;) correct, this is why it's not a full scan, I provide a sysctl to tune that and it's called vm_cache_scan_ratio as I wrote in the original email. If it doesn't swap enough increasing the sysctl will swap more. > > I think you mean he's using an anchor in the lru too in the same way I > > proposed here, but I doubt he's using it nearly as I would, there seems > > to be a fundamental difference in the two algorithms, with mine partly > > covering the work done by his, and not the other way around. > > An anchor in the lru list is definately needed. Some > companies want to run Linux on systems with 256 GB or > more memory. In those systems the amount of CPU time > used to search the inactive list will become a problem, > unless we use a smartly placed anchor. > > Note that I wouldn't want to use the current O(1) VM > code on such a system, because the placement of the > anchor isn't quite smart enough ... this was my point about the o1 vm, making the placement of the anchor is very non obvious, the idea itself sounds fine. > > Lets try combining your ideas and Arjan's ideas into > something that fixes all these problems. So you here agree they're different things. Not sure if my idea is the best for the long run either, but certainly it's needed in 2.4 to handle such load and an equivalent solution (o1 vm is not enough IMO) will be needed in 2.6 as well. The basic idea behind my patch may be the right one for the long term though. However this is all low prio for 2.6 at the moment and I didn't even list this part in the roadmap, because first I need to avoid the rmap lockup before the machine start swapping then I can think about this. (as far as I keep swapoff -a this is not an issue) ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) 2004-02-27 19:08 ` Rik van Riel 2004-02-27 20:29 ` Andrew Morton 2004-02-27 20:31 ` Andrea Arcangeli @ 2004-02-29 6:34 ` Mike Fedyk 2 siblings, 0 replies; 100+ messages in thread From: Mike Fedyk @ 2004-02-29 6:34 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrea Arcangeli, linux-kernel, Andrew Morton Rik van Riel wrote: > On Fri, 27 Feb 2004, Andrea Arcangeli wrote: >>>Then again, your stuff will also find pages the moment they're >>>cleaned, just at the cost of a (little?) bit more CPU time. >> >>exactly, that's an important effect of my patch and that's the only >>thing that o1 vm is taking care of, I don't think it's enough since the >>gigs of cache would still be like a memleak without my code. > > > ... however, if you have a hundred gigabyte of memory, or > even more, then you cannot afford to search the inactive > list for clean pages on swapout. It will end up using too > much CPU time. > > The FreeBSD people found this out the hard way, even on > smaller systems... So that's what the inact_clean list is for in 2.4-rmap. But your inactive lists are always much smaller than the active list on the smallish (< 1.5G) machines... ^ permalink raw reply [flat|nested] 100+ messages in thread
end of thread, other threads:[~2004-03-18 19:52 UTC | newest]
Thread overview: 100+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1u7eQ-6Bz-1@gated-at.bofh.it>
[not found] ` <1ue6M-45w-11@gated-at.bofh.it>
[not found] ` <1uofN-4Rh-25@gated-at.bofh.it>
[not found] ` <1vRz3-5p2-11@gated-at.bofh.it>
[not found] ` <1vRSn-5Fc-11@gated-at.bofh.it>
[not found] ` <1vS26-5On-21@gated-at.bofh.it>
[not found] ` <1wkUr-3QW-11@gated-at.bofh.it>
[not found] ` <1wolx-7ET-31@gated-at.bofh.it>
[not found] ` <1woEM-7Yx-41@gated-at.bofh.it>
[not found] ` <1wp8b-7x-3@gated-at.bofh.it>
[not found] ` <1wp8l-7x-25@gated-at.bofh.it>
[not found] ` <1x0qG-Dr-3@gated-at.bofh.it>
2004-03-12 21:15 ` 2.4.23aa2 (bugfixes and important VM improvements for the high end) Andi Kleen
2004-03-18 19:50 ` Peter Zaitsev
[not found] ` <1woEJ-7Yx-25@gated-at.bofh.it>
[not found] ` <1wp8c-7x-5@gated-at.bofh.it>
[not found] ` <1wprd-qI-21@gated-at.bofh.it>
[not found] ` <1wpUz-Tw-21@gated-at.bofh.it>
[not found] ` <1x293-2nT-7@gated-at.bofh.it>
2004-03-12 21:25 ` Andi Kleen
[not found] <20040304175821.GO4922@dualathlon.random>
2004-03-04 22:14 ` Rik van Riel
2004-03-04 23:24 ` Andrea Arcangeli
2004-03-05 3:43 ` Rik van Riel
2004-02-27 1:33 Andrea Arcangeli
2004-02-27 4:38 ` Rik van Riel
2004-02-27 17:32 ` Andrea Arcangeli
2004-02-27 19:08 ` Rik van Riel
2004-02-27 20:29 ` Andrew Morton
2004-02-27 20:49 ` Rik van Riel
2004-02-27 20:55 ` Andrew Morton
2004-02-27 21:28 ` Andrea Arcangeli
2004-02-27 21:37 ` Andrea Arcangeli
2004-02-28 3:22 ` Andrea Arcangeli
2004-03-01 11:10 ` Nikita Danilov
2004-02-27 21:15 ` Andrea Arcangeli
2004-02-27 22:03 ` Martin J. Bligh
2004-02-27 22:23 ` Andrew Morton
2004-02-28 2:32 ` Andrea Arcangeli
2004-02-28 4:57 ` Wim Coekaerts
2004-02-28 6:18 ` Andrea Arcangeli
2004-02-28 6:45 ` Martin J. Bligh
2004-02-28 7:05 ` Andrea Arcangeli
2004-02-28 9:19 ` Dave Hansen
2004-03-18 2:44 ` Andrea Arcangeli
[not found] ` <20040228061838.GO8834@dualathlon.random.suse.lists.linux.kernel>
2004-02-28 12:46 ` Andi Kleen
2004-02-29 1:39 ` Andrea Arcangeli
2004-02-29 2:29 ` Andi Kleen
2004-02-29 16:34 ` Andrea Arcangeli
2004-02-28 6:10 ` Martin J. Bligh
2004-02-28 6:43 ` Andrea Arcangeli
2004-02-28 7:00 ` Martin J. Bligh
2004-02-28 7:29 ` Andrea Arcangeli
2004-02-28 14:55 ` Rik van Riel
2004-02-28 15:06 ` Arjan van de Ven
2004-02-29 1:43 ` Andrea Arcangeli
[not found] ` < 1078370073.3403.759.camel@abyss.local>
2004-03-04 3:14 ` Peter Zaitsev
2004-03-04 3:33 ` Andrew Morton
2004-03-04 3:44 ` Peter Zaitsev
2004-03-04 4:07 ` Andrew Morton
2004-03-04 4:44 ` Peter Zaitsev
2004-03-04 4:52 ` Andrea Arcangeli
2004-03-04 5:10 ` Andrew Morton
2004-03-04 5:27 ` Andrea Arcangeli
2004-03-04 5:38 ` Andrew Morton
2004-03-05 20:19 ` Jamie Lokier
2004-03-05 20:33 ` Andrea Arcangeli
2004-03-05 21:44 ` Jamie Lokier
2004-03-04 12:12 ` Rik van Riel
2004-03-04 16:21 ` Peter Zaitsev
2004-03-04 18:13 ` Andrea Arcangeli
2004-03-04 17:35 ` Martin J. Bligh
2004-03-04 18:16 ` Andrea Arcangeli
2004-03-04 19:31 ` Martin J. Bligh
2004-03-04 20:21 ` Peter Zaitsev
2004-03-05 10:33 ` Ingo Molnar
2004-03-05 14:15 ` Andrea Arcangeli
2004-03-05 14:32 ` Ingo Molnar
2004-03-05 14:58 ` Andrea Arcangeli
2004-03-05 15:26 ` Ingo Molnar
2004-03-05 15:53 ` Andrea Arcangeli
2004-03-07 8:41 ` Ingo Molnar
2004-03-07 10:29 ` Nick Piggin
2004-03-07 17:33 ` Andrea Arcangeli
2004-03-08 5:15 ` Nick Piggin
2004-03-07 17:24 ` Andrea Arcangeli
2004-03-05 21:28 ` Martin J. Bligh
2004-03-05 18:42 ` Martin J. Bligh
2004-03-05 19:13 ` Andrea Arcangeli
2004-03-05 19:55 ` Martin J. Bligh
2004-03-05 20:29 ` Andrea Arcangeli
2004-03-05 20:41 ` Andrew Morton
2004-03-05 21:07 ` Andrea Arcangeli
2004-03-05 22:12 ` Andrew Morton
2004-03-05 14:34 ` Ingo Molnar
2004-03-05 14:59 ` Andrea Arcangeli
2004-03-05 15:02 ` Ingo Molnar
[not found] ` <20040305150225.GA13237@elte.hu.suse.lists.linux.kernel>
2004-03-05 15:51 ` Andi Kleen
2004-03-05 16:23 ` Ingo Molnar
2004-03-05 16:39 ` Andrea Arcangeli
2004-03-07 8:16 ` Ingo Molnar
2004-03-10 13:21 ` Andi Kleen
2004-03-05 16:42 ` Andrea Arcangeli
2004-03-05 16:49 ` Ingo Molnar
2004-03-05 16:58 ` Andrea Arcangeli
2004-03-05 20:11 ` Jamie Lokier
2004-03-06 5:12 ` Jamie Lokier
2004-03-06 12:56 ` Magnus Naeslund(t)
2004-03-06 13:13 ` Magnus Naeslund(t)
2004-03-07 11:55 ` Ingo Molnar
2004-03-07 6:50 ` Peter Zaitsev
2004-03-02 9:10 ` Kurt Garloff
2004-03-02 15:32 ` Martin J. Bligh
2004-02-27 21:42 ` Hugh Dickins
2004-02-27 23:18 ` Marcelo Tosatti
2004-02-27 22:39 ` Andrew Morton
2004-02-27 20:31 ` Andrea Arcangeli
2004-02-29 6:34 ` Mike Fedyk
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox