From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id E0F786B008C for ; Tue, 26 Jan 2010 11:12:18 -0500 (EST) Date: Tue, 26 Jan 2010 17:11:20 +0100 From: Andrea Arcangeli Subject: Re: [PATCH 00 of 30] Transparent Hugepage support #3 Message-ID: <20100126161120.GN30452@random.random> References: <20100122151947.GA3690@random.random> <20100123175847.GC6494@random.random> <20100125224643.GA30452@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Christoph Lameter Cc: linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Mel Gorman , Andi Kleen , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Chris Wright , Andrew Morton List-ID: On Tue, Jan 26, 2010 at 09:47:51AM -0600, Christoph Lameter wrote: > I have to disable swap to be able to make use of these huge pages? No. > Just because your configuration did not split does not mean that there > is a guarantee of them not splitting. You need to guarantee that the VM > does not split them in order to be able to safely refer to them from > code (like I/O paths). No. O_DIRECT already works on those pages without splitting them, there is no need to split them, just run 512 gups like you would be doing if those weren't hugepages. If your I/O can be interrupted then just use mmu notifier, call gup_fast, and be notified if anything runs that split the page. Splitting the page doesn't mean relocating it, DMA won't be able to notice. So if you use mmu notifier just 1 gup + put_page will be enough exactly because with mmu notifier you won't need refcounting on tail pages and head pages at all! If you don't have longstanding mapping and a way to synchronously interrupt the visibility of hugepages from your device, then likely you work with small dma sizes like storage and networking does, and gup each 4k will be fine. > Earlier you stated that reclaim can remove 4k pieces of huge pages after a > split. How does gup keep the huge pages stable while doing I/O? Does gup > submit 512 pointers to 4k chunks or 1 pointer to a 2M chunk? gup works like now, you just write code that works today on a fragmented hugepage, and it'll still work. So you need to run 512 gup_fast to be sure all 4k fragments are stable. But if you can use mmu notifier just one gup_fast(&head_page), put_page(head_page) will be enough after you're registered. I'm unsure exactly what you need to do that won't be feasible with mmu notifier and 1 gup or 512 gup. > This implementation seems to only address the TLB pressure issue > but not the scaling issue that arises because we have to handle data in > 4k chunks (512 4k pointers instead of one 2M pointer). Scaling is not > addressed because complex fallback logic sabotages a basic benefit of > huge pages. Scaling is addressed for everything, including collapsing the hugepage back after swapin if they're fragmented because of that. Furthermore we want to remove split_huge_page from as many paths as possible but Rome wasn't built in a day. We need to stabilize and stress this code now, then we include it, and extend it to tmpfs and pagecache. Note a malloc(3G)+memset(3G) takes >5sec with lockdep without transparent hugepage, or <2sec after "echo always >enabled", TLB pressure is irrelevant in that workload that spends all time allocating pages and clearing them through kernel direct mapping. Your idea that this is only taking care of TLB pressure is totally wrong and I posted benchmarks already as proof (which become extreme the moment you enable lockdep and all the little locks becomes more costly, so avoiding 512 page faults and doing a single call to alloc_pages(order=9) speedup the workload more than 100%). > > performance and functionality than what my patch delivers already > > (ok swapping will be a little more efficient if done through 2M I/O > > but swap performance isn't so critical). Our objective is to over time > > eliminate the need of split_huge_page. khugepaged will remain required > > Ok then establish some way to make these huge pages stable. Again: register into mmu notifer, call gup_fast; put_page, and you're done. 1 op, and just 3 cachelines for pgd,pud and pmd to get to the page. > That all depends on what you mean by guarantee I guess. mmu notifier is a must if the mapping is longstanding or you'll lock the ram. It's also a lot more efficient than doing 512 gup_fast which would achieve the same effect but it's evil against the VM (lock the user virtual memory in ram) and requires 512 gup instead of just 1. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org