From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e4.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id m3FHQGs0024924 for ; Tue, 15 Apr 2008 13:26:16 -0400 Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.7) with ESMTP id m3FHQGbS340536 for ; Tue, 15 Apr 2008 13:26:16 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id m3FHQGMd015286 for ; Tue, 15 Apr 2008 13:26:16 -0400 Date: Tue, 15 Apr 2008 10:26:14 -0700 From: Nishanth Aravamudan Subject: Re: [PATCH] Smarter retry of costly-order allocations Message-ID: <20080415172614.GE15840@us.ibm.com> References: <20080411233500.GA19078@us.ibm.com> <20080411233553.GB19078@us.ibm.com> <20080415000745.9af1b269.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20080415000745.9af1b269.akpm@linux-foundation.org> Sender: owner-linux-mm@kvack.org Return-Path: To: Andrew Morton Cc: mel@csn.ul.ie, clameter@sgi.com, apw@shadowen.org, kosaki.motohiro@jp.fujitsu.com, linux-mm@kvack.org List-ID: On 15.04.2008 [00:07:45 -0700], Andrew Morton wrote: > On Fri, 11 Apr 2008 16:35:53 -0700 Nishanth Aravamudan wrote: > > > Because of page order checks in __alloc_pages(), hugepage (and similarly > > large order) allocations will not retry unless explicitly marked > > __GFP_REPEAT. However, the current retry logic is nearly an infinite > > loop (or until reclaim does no progress whatsoever). For these costly > > allocations, that seems like overkill and could potentially never > > terminate. > > > > Modify try_to_free_pages() to indicate how many pages were reclaimed. > > Use that information in __alloc_pages() to eventually fail a large > > __GFP_REPEAT allocation when we've reclaimed an order of pages equal to > > or greater than the allocation's order. This relies on lumpy reclaim > > functioning as advertised. Due to fragmentation, lumpy reclaim may not > > be able to free up the order needed in one invocation, so multiple > > iterations may be requred. In other words, the more fragmented memory > > is, the more retry attempts __GFP_REPEAT will make (particularly for > > higher order allocations). > > > > hm, there's rather a lot of speculation and wishful thinking in that > changelog. Sorry about that -- I realized after sending (and reading your other e-mails on LKML/linux-mm about changelogs) that I should have referred to Mel's previous testing results, at a minimum. > If we put this through -mm and into mainline then nobody will test it > and we won't discover whether it's good or bad until late -rc at best. > > So... would like to see some firmer-looking testing results, please. Do Mel's e-mails cover this sufficiently? > I _assume_ this patch was inspired by some observed problem? What was that > problem, and what effect did the patch have? To make it explicit, the problem is in the userspace interface to growing the static hugepage pool (/proc/sys/vm/nr_hugepages). An administrator may request 100 hugepages, but due to fragmentation, load, etc. on the system, only 60 are granted. The administrator could, however, try to request 100 hugepages again, and be granted 70 on the second request. Then 72 on the third, then 73 on the fourth, 73 still on the fifth, and then 74 on the sixth, etc. Numbers are made up here, but similar patterns are observed in practice. Rather than force user space to keep trying until some point (which user space is not really in a point to observe, given patterns like {72, 73, 73, 74}, that is no growth followed by growth) I think putting the smarts in the kernel to leverage reclaim is a better approach. And Mel's results indicate the /proc interface "performs" better (achieves a larger number of hugepages on first try) than before. > And what scenarios might be damaged by this patch, and how do we test > for them? This is a good question -- Mel's testing does cover some of this by verifying the reclaim path is not destroyed by the extra checks. However, unless there is quite serious fragmentation, I think most of the lower-order allocations (which implicitly are __GFP_NOFAIL) succeed on one iteration through __alloc_pages anyways. The impact to lower-order allocations should just be the changed return value, though, as we don't look at the reclaim success to determine if we should quite reclaiming in that case. > The "repeat until we've reclaimed 1< magic number, and its value is "1". How did we arrive at this magic > number and why isn't "2" a better one? Or "0.5"? Another good question, and one that should have been answered in my changelog, I'm sorry. We have some order of allocation to satisfy. Relying on lumpy reclaim to attempt to free up lumps of memory, if we have reclaimed an order greater than or equal to the order of the requested allocation, we should be able to satisfy the allocation. If we can't at that point, we fail the allocation. I believe this is a good balance between trying to succeed large allocations when possible and looping in the core VM forever. "2" may be a better value in one sense, because we should be even more likely to succeed the allocation if we've freed twice as many pages as we needed, but we'd try longer at the tail end of the reclaim loop (having gone through several times not getting a large enough contiguous region free), even though we probably should have succeeded earlier. "0.5" won't work, I don't think, because that would imply reclaiming half as many pages as the original request. Unless there were already about half the number of pages free (but no more), the allocation would fail early, even though it might succeed a few more times down the road. More importantly, "1" subsumes the case where half the pages are free now, and we need to reclaim the other half -- as we'll succeed the allocation at some point and stop reclaiming. Really, that's the same reason that "2" would be better -- or really __GFP_NOFAIL would be. But given that hugepage orders are very large (and this is all of PAGE_ALLOC_COSTLY_ORDER to begin with), I don't think we want them to be NOFAIL. Thanks, Nish -- Nishanth Aravamudan IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org