From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 376566B0044 for ; Sat, 19 Dec 2009 11:42:44 -0500 (EST) Date: Sat, 19 Dec 2009 17:41:43 +0100 From: Andrea Arcangeli Subject: Re: [PATCH 25 of 28] transparent hugepage core Message-ID: <20091219164143.GC29790@random.random> References: <4d96699c8fb89a4a22eb.1261076428@v2.random> <20091218200345.GH21194@csn.ul.ie> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20091218200345.GH21194@csn.ul.ie> Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: linux-mm@kvack.org, Marcelo Tosatti , Adam Litke , Avi Kivity , Izik Eidus , Hugh Dickins , Nick Piggin , Rik van Riel , Andi Kleen , Dave Hansen , Benjamin Herrenschmidt , Ingo Molnar , Mike Travis , KAMEZAWA Hiroyuki , Christoph Lameter , Chris Wright , Andrew Morton List-ID: > On Thu, Dec 17, 2009 at 07:00:28PM -0000, Andrea Arcangeli wrote: > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > new file mode 100644 > > --- /dev/null > > +++ b/include/linux/huge_mm.h > > @@ -0,0 +1,110 @@ > > +#ifndef _LINUX_HUGE_MM_H > > +#define _LINUX_HUGE_MM_H > > + > > +extern int do_huge_anonymous_page(struct mm_struct *mm, > > + struct vm_area_struct *vma, > > + unsigned long address, pmd_t *pmd, > > + unsigned int flags); > > +extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > > + pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, > > + struct vm_area_struct *vma); > > +extern int do_huge_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > > + unsigned long address, pmd_t *pmd, > > + pmd_t orig_pmd); > > +extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm); > On Fri, Dec 18, 2009 at 08:03:46PM +0000, Mel Gorman wrote: > The naming of "huge" might bite in the ass later if/when transparent > support is applied to multiple page sizes. Granted, it's not happening > any time soon. Granted ;). But why not huge? I think you just want to add "pmd" there maybe, like do_huge_pmd_anonymous_page and do_huge_pmd_wp_page. The other two already looks fine to me. Huge means it's part of the hugepage support so I would keep it, otherwise you'd need to like a name like get_pmd_pte (that is less intuitiv than get_pmd_huge_pte). > > +extern struct page *follow_trans_huge_pmd(struct mm_struct *mm, > > + unsigned long addr, > > + pmd_t *pmd, > > + unsigned int flags); > > +extern int zap_pmd_trans_huge(struct mmu_gather *tlb, > > + struct vm_area_struct *vma, > > + pmd_t *pmd); > > + > > +enum transparent_hugepage_flag { > > + TRANSPARENT_HUGEPAGE_FLAG, > > + TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, > > + TRANSPARENT_HUGEPAGE_DEFRAG_FLAG, > > Defrag is misleading. Glancing through the rest of the patch, "try harder" > would be a more appropriate term because it uses __GFP_REPEAT. No. Yes, open source has the defect that some people has nothing better to do so they break visible kernel APIs in /sys /proc and marks them obsoleted in make menuconfig and they go fixup userland often with little practical gain but purely aesthetical purist reasons, so to try to avoid that I just tried to make a visible kernel API that has a chance to survive 1 year of development without breaking userland. That means calling it "defrag" because "defrag" eventually will happen. Right now the best approximation is __GFP_REPEAT, so be it, but the visible kernel API must be done in a way that isn't tied to current internal implementation or cleverness of defrag. So please help in fighting the constant API breakage in /sys and those OBSOLETE marks in menuconfig (you may disable it if your userland is uptodate, etc...). In fact I ask you to review from the entirely opposite side, so thinking more long term. Still trying not to overdesign though. > > > diff --git a/mm/Makefile b/mm/Makefile > > --- a/mm/Makefile > > +++ b/mm/Makefile > > @@ -40,3 +40,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f > > obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o > > obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o > > obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o > > +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > new file mode 100644 > > --- /dev/null > > +++ b/mm/huge_memory.c > > Similar on naming. Later someone will get congused as to why there is > hugetlbfs and huge_memory. Why? Unless you want to change HUGETLBFS name too and remove HUGE from there too, there's absolutely no reason to remove the huge name from transparent hugepages. In fact to the contrary I used HUGE exactly because that is how hugetlbfs call them! Otherwise I would have used transparent largepages. Whatever hugetlbfs uses, transparent hugepage also has to use that naming. Otherwise it's a mess. They're indentical features, one is transparent allocated, the other is not and requires an pseudo-fs to allocate them. The result, performance, and pagetable layout generated is identical too. I've no clue what confusion you're worried about here, I didn't call this hugetlbfs. This is transparent_hugepage, and that seems really strightforwad and obvious what it means (same thing, one through fs, the other transparent). You can argue I should have called it transparent_hugetlb! That one we can argue about, but arguing about the "huge" doesn't make sense to me. If you want to rename this to transparent_hugetlb and I'll do it. So it's even more clear the only difference between hugetlbfs and transparent_hugetlb. But personally I think hugetlb is not the appropriate name only for one reason: later we may want to use hugepages on pagecache too, but those pagecache will be hugepages, but not mapped by any tlb if they're the result of a read/write. In fact this is true for hugetlbfs too when you read/write, no tlb involvement at all! Which is why hugetlbfs should also renamed to hugepagefs if something! > > +static int __do_huge_anonymous_page(struct mm_struct *mm, > > + struct vm_area_struct *vma, > > + unsigned long address, pmd_t *pmd, > > + struct page *page, > > Maybe this should be do_pmd_anonymous page and match what do_anonymous_page > does as much as possible. This might offset any future problems related to > transparently handling pages at other page table levels. Why not do_huge_pmd_anonymous_page. This is an huge pmd after all and as said above removing huge from all places it's going to screw over some funtions that are specific for huge pmd and not for regular pmd. I'll make this change, it's fine with me. > > + unsigned long haddr) > > +{ > > + int ret = 0; > > + pgtable_t pgtable; > > + > > + VM_BUG_ON(!PageCompound(page)); > > + pgtable = pte_alloc_one(mm, address); > > + if (unlikely(!pgtable)) { > > + put_page(page); > > + return VM_FAULT_OOM; > > + } > > + > > + clear_huge_page(page, haddr, HPAGE_NR); > > + > > Ideally insead of defining things like HPAGE_NR, the existing functions for > multiple huge page sizes would be extended to return the "huge page size > corresponding to a PMD". You mean PMD_SIZE? Again this is the whole discussion if HPAGE should be nuked as a whole in favour of PMD_something. I'm unsure if favouring the PMD/PUD nomenclature is the way to go, considering the main complaint one can have is for archs that may have mixed page size that isn't a match of PMD/PUD at all! I'm open to suggestions just worrying about huge PUD seems not realistic, while a mixed page size that won't ever match pmd or pud is more realistic. power can't do it as it can't fallback, but maybe ia64 or others can do, I don't know. Surely anything realistic won't match PUD this is my main reason for disliking binding the whole patch to pmd sizes like if PUD sizes would be relevant. > > + __SetPageUptodate(page); > > + smp_wmb(); > > + > > Need to explain why smp_wmb() is needed there. It doesn't look like > you're protecting the bit set itself. More likely you are making sure > the writes in clear_huge_page() have finished but that's a guess. > Comment. Yes. Same as __pte_alloc. You're not the first asking this, I'll add comment. > > + spin_lock(&mm->page_table_lock); > > + if (unlikely(!pmd_none(*pmd))) { > > + put_page(page); > > + pte_free(mm, pgtable); > > Racing fault already filled in the PTE? If so, comment please. Again, > matching how do_anonymous_page() does a similar job would help > comprehension. Yes racing thread already mapped in a pmd large (or a pte if hugepage allocation failed). Adding comment... ;) > > + if (haddr >= vma->vm_start && haddr + HPAGE_SIZE <= vma->vm_end) { > > + if (unlikely(anon_vma_prepare(vma))) > > + return VM_FAULT_OOM; > > + page = alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_COMP| > > + (transparent_hugepage_defrag(vma) ? > > GFP_HIGHUSER_MOVABLE should only be used if hugepages_treat_as_movable > is set in /proc/sys/vm. This should be GFP_HIGHUSER only. Why? Either we move htlb_alloc_mask into common code so that it exists even when HUGETLBFS=n (like I had to do to share the copy_huge routines to share as much as possible with hugetlbfs), or this should remain movable to avoid crippling down the feature. hugepages_treat_as_movable right now only applies to hugetlbfs. We've only to decide if to apply it to transparent hugepages too. > > + if (transparent_hugepage_debug_cow() && new_page) { > > + put_page(new_page); > > + new_page = NULL; > > + } > > + if (unlikely(!new_page)) { > > This entire block needs be in a demote_pmd_page() or something similar. > It's on the hefty side for being in the main function. That said, I > didn't spot anything wrong in there either. Yeah this is a cleanup I should do but it's not as easy as it looks or I would have done it already when Adam asked me a few weeks ago. > > + } > > + } > > + > > + spin_lock(&mm->page_table_lock); > > + if (unlikely(!pmd_same(*pmd, orig_pmd))) > > + goto out_free_pages; > > + else > > + get_page(page); > > + spin_unlock(&mm->page_table_lock); > > + > > + might_sleep(); > > Is this check really necessary? We could already go alseep easier when > allocating pages. Ok, removed might_sleep(). > > > + for (i = 0; i < HPAGE_NR; i++) { > > + copy_user_highpage(pages[i], page + i, > > More nasty naming there. Needs to be cleared that pages is your demoted > base pages and page is the existing compound page. what exactly is not clear? You already asked to move this code into a separate function. What else to document this is the "fallback" copy to 4k pages? renaming pages[] to 4k_pages[] doesn't seem necessary to me, besides copy_user_highpage work on PAGE_SIZE not HPAGE_SIZE. > > + pmd_t _pmd; > > + /* > > + * We should set the dirty bit only for FOLL_WRITE but > > + * for now the dirty bit in the pmd is meaningless. > > + * And if the dirty bit will become meaningful and > > + * we'll only set it with FOLL_WRITE, an atomic > > + * set_bit will be required on the pmd to set the > > + * young bit, instead of the current set_pmd_at. > > + */ > > + _pmd = pmd_mkyoung(pmd_mkdirty(*pmd)); > > + set_pmd_at(mm, addr & HPAGE_MASK, pmd, _pmd); > > + } > > + page += (addr & ~HPAGE_MASK) >> PAGE_SHIFT; > > More HPAGE vs PMD here. All of them or none, not sure why you mention it on the MASK, maybe it's just an accident. Every single HPAGE_SIZE has to be changed too! Not just HPAGE_MASK, or it's pointless. > > +static int __split_huge_page_splitting(struct page *page, > > + struct vm_area_struct *vma, > > + unsigned long address) > > +{ > > + struct mm_struct *mm = vma->vm_mm; > > + pmd_t *pmd; > > + int ret = 0; > > + > > + spin_lock(&mm->page_table_lock); > > + pmd = page_check_address_pmd(page, mm, address, > > + PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG); > > + if (pmd) { > > + /* > > + * We can't temporarily set the pmd to null in order > > + * to split it, pmd_huge must remain on at all times. > > + */ > > Why, to avoid a double fault? Or to avoid a case where the huge page is > being split, another fault occurs and zero-filled pages get faulted in? Well initially I did pmdp_clear_flush and overwritten it. It was a nasty race to find and fix, wasted some time on it. Yes once the pmd is zero, anything can happen, it's like a page not faulted in yet, and nobody will take the slow path of pmd_huge anymore to serialize against the split_huge_page with anon_vma->lock. > I'm afraid I ran out of time at this point. It'll be after the holidays > before I get time for a proper go at it. Sorry. Understood. The main trouble I see in your comments is the pmd vs huge name. Please consider what I mentioned above about more realistic different hpage sizes that won't match either pmd/pud. And the pud_size being unusable until we get higher orders of magnitude of ram sizes. Then decide if to change every single HPAGE to PMD or to stick with this. I'm personally netural, I _never_ care about names, I only care about what assembly gcc produces. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org