From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S263155AbUB0WJj (ORCPT ); Fri, 27 Feb 2004 17:09:39 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S263134AbUB0WGi (ORCPT ); Fri, 27 Feb 2004 17:06:38 -0500 Received: from e4.ny.us.ibm.com ([32.97.182.104]:35523 "EHLO e4.ny.us.ibm.com") by vger.kernel.org with ESMTP id S263164AbUB0WD3 (ORCPT ); Fri, 27 Feb 2004 17:03:29 -0500 Date: Fri, 27 Feb 2004 14:03:07 -0800 From: "Martin J. Bligh" To: Andrea Arcangeli , Andrew Morton cc: Rik van Riel , linux-kernel@vger.kernel.org Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end) Message-ID: <162060000.1077919387@flay> In-Reply-To: <20040227211548.GI8834@dualathlon.random> References: <20040227173250.GC8834@dualathlon.random> <20040227122936.4c1be1fd.akpm@osdl.org> <20040227211548.GI8834@dualathlon.random> X-Mailer: Mulberry/2.1.2 (Linux/x86) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org > note that the 4:4 split is wrong in 99% of cases where people needs 64G > gigs. I'm advocating strongly for the 2:2 split to everybody I talk > with, I'm trying to spread the 2:2 idea because IMHO it's an order of > magnitude simpler and an order of magnitude superior. Unfortunately I > could get not a single number to back my 2:2 claims, since the 4:4 > buzzword is spreading and people only test with 4:4. so it's pretty hard > for me to spread the 2:2 buzzword. For the record, I for one am not opposed to doing 2:2 instead of 4:4. What pisses me off is people trying to squeeze large amounts of memory into 3:1, and distros pretending it's supportable, when it's never stable across a broad spectrum of workloads. Between 2:2 and 4:4, it's just a different overhead tradeoff. > 4:4 makes no sense at all, the only advantage of 4:4 w.r.t. 2:2 is that > they can map 2.7G per task of shm instead of 1.7G per task of shm. Eh? You have a 2GB difference of user address space, and a 1GB difference of shm size. You lost a GB somewhere ;-) Depending on whether you move TASK_UNMAPPPED_BASE or not, it you might mean 2.7 vs 0.7 or at a pinch 3.5 vs 1.5, I'm not sure. > syscall and irq. I expect the databases will run an order of magnitude > faster with _2:2_ in a 64G configuration, with _1.7G_ per process of shm > mapped, instead of their 4:4 split with 2.7G (or more, up to 3.9 ;) > mapped per task. That may well be true for some workloads, I suspect it's slower for others. One could call the tradeoff either way. > I don't mind if 4:4 gets merged but I recommend db vendors to benchmark > _2:2_ against 4:4 before remotely considering deploying 4:4 in > production. Then of course let me know since I had not the luck to get > any number back and I've no access to any 64G box. If you send me a *simple* simulation test, I'll gladly run it for you ;-) But I'm not going to go fiddle with Oracle, and thousands of disks ;-) > I don't care about 256G with 2:2 split, since intel and hp are now going > x86-64 too. Yeah, I don't think we ever need to deal with that kind of insanity ;-) >> averse to objrmap for file-backed mappings either - I agree that the search >> problems which were demonstrated are unlikely to bite in real life. > > cool. > > Martin's patch from IBM is a great start IMHO. I found a bug in the vma > flags check though, VM_RESERVED should be checked too, not only > VM_LOCKED, unless I'm missing something, but it's a minor issue. I didn't actually write it - that was Dave McCracken ;-) I just suggested the partial aproach (because I'm dirty and lazy ;-)) and carried it in my tree. I agree with Andrew's comments though - it's not nice having the dual approach of the partial, but the complexity of the full approach is a bit scary and buys you little in real terms (performance and space). I still believe that creating an "address_space like structure" for anon memory, shared across VMAs is an idea that might give us cleaner code - it also fixes other problems like Andi's NUMA API binding. > We can write a testcase ourself, it's pretty easy, just create a 2.7G > file in /dev/shm, and mmap(MAP_SHARED) it from 1k processes and fault in > all the pagetables from all tasks touching the shm vma. Then run a > second copy until the machine starts swapping and see how thing goes. To > do this you need probably 8G, this is why I didn't write the testcase > myself yet ;). maybe I can simulate with less shm and less tasks on 1G > boxes too, but the extreme lru effects of point 3 won't be visibile > there, the very same software configuration works fine on 1/2G boxes on > stock 2.4. problems showsup when the lru grows due the algorithm not > contemplating million of dirty swapcache in a row at the end of the lru > and some gigs of free cache ad the head of the lru. the rmap-only issues > can also be tested with math, no testcase is needed for that. I don't have time at the moment to go write it at the moment, but I can certainly run it on large end hardware if that helps. M.