From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261861AbULOEO0 (ORCPT ); Tue, 14 Dec 2004 23:14:26 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261856AbULOEO0 (ORCPT ); Tue, 14 Dec 2004 23:14:26 -0500 Received: from cantor.suse.de ([195.135.220.2]:64938 "EHLO Cantor.suse.de") by vger.kernel.org with ESMTP id S261862AbULOEI5 (ORCPT ); Tue, 14 Dec 2004 23:08:57 -0500 Date: Wed, 15 Dec 2004 05:08:54 +0100 From: Andi Kleen To: Brent Casavant Cc: "Martin J. Bligh" , Andi Kleen , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-ia64@vger.kernel.org Subject: Re: [PATCH 0/3] NUMA boot hash allocation interleaving Message-ID: <20041215040854.GC27225@wotan.suse.de> References: <9250000.1103050790@flay> <20041214191348.GA27225@wotan.suse.de> <19030000.1103054924@flay> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 14, 2004 at 05:24:02PM -0600, Brent Casavant wrote: > On Tue, 14 Dec 2004, Martin J. Bligh wrote: > > > --On Tuesday, December 14, 2004 20:13:48 +0100 Andi Kleen wrote: > > > > > I originally was a bit worried about the TLB usage, but it doesn't > > > seem to be a too big issue (hopefully the benchmarks weren't too > > > micro though) > > > > Well, as long as we stripe on large page boundaries, it should be fine, > > I'd think. On PPC64, it'll screw the SLB, but ... tough ;-) We can either > > turn it off, or only do it on things larger than the segment size, and > > just round-robin the rest, or allocate from node with most free. > > Is there a reasonably easy-to-use existing infrastructure to do this? No. It will be a lot of work actually, requiring new code for each architecture and may even be impossible on some. The current hugetlb code is not really suitable for this because it requires an preallocated pool and only works for user space. I actually considered implementing it for x86-64 some time ago for the modules, but then I never bothered. On AMD systems I actually prefer to use small pages here. The reason is that Opteron has a separated large and small pages TLB and the small pages TLB is much bigger. When someone else uses huge TLB pages too (user space or kernel direct mapping) then it's actually a good idea to use small pages. Also it may be difficult in some cases to even allocate such large pages even at boot and impossible to do it later when a module loads. Also at least on IA64 the large page size is usually 1-2GB and that would seem to be a little too large to me for interleaving purposes. Also it may prevent the purpose you implemented it - not using too much memory from a single node. Using other page sizes would be probably tricky because the linux VM can currently barely deal with two page sizes. I suspect handling more would need some VM infrastructure effort at least in the changed port. > I didn't find anything in my examination of vmalloc itself, so I gave > up on the idea. > > And just to clarify, are you saying you want to see this before inclusion > in mainline kernels, or that it would be nice to have but not necessary? I wouldn't do anything in this area unless somebody shows a benchmark / profiling results where TLB pressure makes a clear difference. And even then it may be not worth the effort. -Andi