From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S266291AbUHMRaY (ORCPT ); Fri, 13 Aug 2004 13:30:24 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S266284AbUHMRaY (ORCPT ); Fri, 13 Aug 2004 13:30:24 -0400 Received: from omx2-ext.SGI.COM ([192.48.171.19]:27044 "EHLO omx2.sgi.com") by vger.kernel.org with ESMTP id S266291AbUHMR3X (ORCPT ); Fri, 13 Aug 2004 13:29:23 -0400 Message-ID: <411CFAE6.1010109@sgi.com> Date: Fri, 13 Aug 2004 12:31:18 -0500 From: Ray Bryant User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 X-Accept-Language: en-us, en MIME-Version: 1.0 Newsgroups: fa.linux.kernel To: "Martin J. Bligh" CC: Jesse Barnes , akpm@osdl.org, linux-kernel@vger.kernel.org, steiner@sgi.com Subject: Re: [PATCH] allocate page caches pages in round robin fasion References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Hi Martin, Martin J. Bligh wrote: > > Does that actually happen though? Looking at the current code makes me think > it'll keep some pages free on all nodes at all times, and if kswapd does > it's job, we'll never fall back across nodes. Now ... I think that's broken, > but I think that's what currently happens - that was what we discussed at > KS ... I might be misreading it though, I should test it. > > Even if that's not true, allocating all your most recent stuff off-node is > still crap (so either way, I'd agree the current situation is broken), but > I don't think the solution is to push ALL your accesses (with n-1/n probability) > off-node ... we need to be more careful than that ... > I think you're missing out on the typical workload situation where we run into this problem. Just to make things a bit more specific, lets assume we are on a 128 node (256 P) system, with 4 GB per node. Let's assume that we have a 100 GB data file that we access periodically during the run, accesses to that data file are done in random access fashion from each node. The program starts out by reading in the data file, then forks off 256 copies of itself, and allocates 1 GB per CPU of local storage via MPOL_DEFAULT. All of those pages had better be in local or the computation will be unbalanced and run as slowly as the slowest node. As I read the __alloc_pages() code, those 100 GB of data pages will be allocated on the node that did the file read; when that node fills up, we will spill the allocation to adjacent nodes (this is the first loop of __alloc_pages(), kswapd doesn't get invoked until that first loop fails). (kswapd() doesn't get invoked until all of the zones in the zonelist are full. All of memory is in that zonelist, unless we have cpusets enabled. So the priority is to spill off node first and then swap() second.) Now the application starts allocating its 2 GB of local, and the nodes where the page cache was allocated all get non-local pages allocated. (Once again, this happens in the first loop of __alloc_pages().) The ratio of accesses to local data pages versus access to remote page cache pages is unfavorable for local page cache allocation, since the page cache pages are accessed at a tiny fraction of the rate of the data pages. Now I suppose you could argue that the application should fork first and then read in 1/256th of the data on each cpu. The problems with this, in general, are twofold: (1) It could have been a simple "cp" in a startup script that did the read.. We can't fix all of those things as well. (2) The application may be an ISV's program that is not NUMA aware. We can fix most of that by wrappering the program with control scripts, but requiring the ISV to build a specific NUMA aware version of the binary for Altix is oftentimes not feasible. (And because the allocation policy is MPOL_DEFAULT, the application doesn't have to have NUMA API calls imbedded in the program.) > >>>If we round-robin it ... surely 7/8 of your data (on your 8 node machine) >>>will ALWAYS be off-node ? I thought we discussed this at KS/OLS - what is >>>needed is to punt old pages back off onto another node, rather than >>>swapping them out. That way all your pages are going to be local. >> Surely you can't be suggesting that I migrate a page cache page to a local node just to read it? If file accesses are random and global, won't you end up just bouncing page cache pages hither and yon? Surely it is better just to copy the data remotely to the current node and leave it where it is? (YMMV -- all of these tradeoffs are clearly workload dependent.) >>That gets complicated pretty quickly I think. We don't want to constantly >>shuffle pages between nodes with kswapd, and there's also the problem of >>deciding when to do it.