* Hugetlbpages in very large memory machines.......
@ 2004-03-13 3:45 Ray Bryant
0 siblings, 0 replies; 14+ messages in thread
From: Ray Bryant @ 2004-03-13 3:45 UTC (permalink / raw)
To: lse-tech, linux-ia64@vger.kernel.org, linux-kernel
We've run into a scaling problem using hugetlbpages in very large memory machines, e. g. machines
with 1TB or more of main memory. The problem is that hugetlbpage pages are not faulted in, rather
they are zeroed and mapped in in by hugetlb_prefault() (at least on ia64), which is called in
response to the user's mmap() request. The net is that all of the hugetlb pages end up being
allocated and zeroed by a single thread, and if most of the machine's memory is allocated to hugetlb
pages, and there is 1 TB or more of main memory, zeroing and allocating all of those pages can take
a long time (500 s or more).
We've looked at allocating and zeroing hugetlbpages at fault time, which would at least allow
multiple processors to be thrown at the problem. Question is, has anyone else been working on
this problem and might they have prototype code they could share with us?
Thanks,
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
raybry@sgi.com raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------
^ permalink raw reply [flat|nested] 14+ messages in thread* Hugetlbpages in very large memory machines.......
@ 2004-03-13 3:44 Ray Bryant
2004-03-13 3:48 ` Andi Kleen
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: Ray Bryant @ 2004-03-13 3:44 UTC (permalink / raw)
To: lse-tech, linux-ia64@vger.kernel.org, linux-kernel
We've run into a scaling problem using hugetlbpages in very large memory machines, e. g. machines
with 1TB or more of main memory. The problem is that hugetlbpage pages are not faulted in, rather
they are zeroed and mapped in in by hugetlb_prefault() (at least on ia64), which is called in
response to the user's mmap() request. The net is that all of the hugetlb pages end up being
allocated and zeroed by a single thread, and if most of the machine's memory is allocated to hugetlb
pages, and there is 1 TB or more of main memory, zeroing and allocating all of those pages can take
a long time (500 s or more).
We've looked at allocating and zeroing hugetlbpages at fault time, which would at least allow
multiple processors to be thrown at the problem. Question is, has anyone else been working on
this problem and might they have prototype code they could share with us?
Thanks,
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
raybry@sgi.com raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: Hugetlbpages in very large memory machines....... 2004-03-13 3:44 Ray Bryant @ 2004-03-13 3:48 ` Andi Kleen 2004-03-13 5:49 ` William Lee Irwin III 2004-03-14 2:45 ` Andrew Morton 2004-03-13 3:55 ` William Lee Irwin III ` (2 subsequent siblings) 3 siblings, 2 replies; 14+ messages in thread From: Andi Kleen @ 2004-03-13 3:48 UTC (permalink / raw) To: Ray Bryant; +Cc: lse-tech, linux-ia64@vger.kernel.org, linux-kernel On Fri, Mar 12, 2004 at 09:44:03PM -0600, Ray Bryant wrote: > We've run into a scaling problem using hugetlbpages in very large memory > machines, e. g. machines with 1TB or more of main memory. The problem is > that hugetlbpage pages are not faulted in, rather they are zeroed and > mapped in in by hugetlb_prefault() (at least on ia64), which is called in > response to the user's mmap() request. The net is that all of the hugetlb > pages end up being allocated and zeroed by a single thread, and if most of > the machine's memory is allocated to hugetlb pages, and there is 1 TB or > more of main memory, zeroing and allocating all of those pages can take a > long time (500 s or more). > > We've looked at allocating and zeroing hugetlbpages at fault time, which > would at least allow multiple processors to be thrown at the problem. > Question is, has anyone else been working on > this problem and might they have prototype code they could share with us? Yes. I ran into exactly this problem with NUMA API too. mbind() runs after mmap, but it cannot work anymore when the pages are already allocated. I fixed it on x86-64/i386 by allocating the pages lazily. Doing it for IA64 has been on the todo list too. i386/x86-64 Code as an example attached. One drawback is that the out of memory handling is lot less nicer than it was before - when you run out of hugepages you get SIGBUS now instead of a ENOMEM from mmap. Maybe some prereservation would make sense, but that would be somewhat harder. Alternatively fall back to smaller pages if possible (I was told it isn't easily possible on IA64) -Andi diff -burpN -X ../KDIFX linux-2.6.2/arch/i386/mm/hugetlbpage.c linux-2.6.2-numa/arch/i386/mm/hugetlbpage.c --- linux-2.6.2/arch/i386/mm/hugetlbpage.c 2004-02-24 20:48:10.000000000 +0100 +++ linux-2.6.2-numa/arch/i386/mm/hugetlbpage.c 2004-02-20 18:52:57.000000000 +0100 @@ -329,41 +333,43 @@ zap_hugepage_range(struct vm_area_struct spin_unlock(&mm->page_table_lock); } -int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma) +/* page_table_lock hold on entry. */ +static int +hugetlb_alloc_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long addr, int write_access) { - struct mm_struct *mm = current->mm; - unsigned long addr; - int ret = 0; - - BUG_ON(vma->vm_start & ~HPAGE_MASK); - BUG_ON(vma->vm_end & ~HPAGE_MASK); - - spin_lock(&mm->page_table_lock); - for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) { unsigned long idx; - pte_t *pte = huge_pte_alloc(mm, addr); - struct page *page; + int ret; + pte_t *pte; + struct page *page = NULL; + struct address_space *mapping = vma->vm_file->f_mapping; + pte = huge_pte_alloc(mm, addr); if (!pte) { - ret = -ENOMEM; + ret = VM_FAULT_OOM; goto out; } - if (!pte_none(*pte)) - continue; idx = ((addr - vma->vm_start) >> HPAGE_SHIFT) + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); page = find_get_page(mapping, idx); if (!page) { - /* charge the fs quota first */ - if (hugetlb_get_quota(mapping)) { - ret = -ENOMEM; + /* Should do this at prefault time, but that gets us into + trouble with freeing right now. */ + ret = hugetlb_get_quota(mapping); + if (ret) { + ret = VM_FAULT_OOM; goto out; } - page = alloc_hugetlb_page(); + + page = alloc_hugetlb_page(vma); if (!page) { hugetlb_put_quota(mapping); - ret = -ENOMEM; + + /* Instead of OOMing here could just transparently use + small pages. */ + + ret = VM_FAULT_OOM; goto out; } ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC); @@ -371,23 +377,62 @@ int hugetlb_prefault(struct address_spac if (ret) { hugetlb_put_quota(mapping); free_huge_page(page); + ret = VM_FAULT_SIGBUS; goto out; } - } + ret = VM_FAULT_MAJOR; + } else + ret = VM_FAULT_MINOR; + set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE); - } -out: + /* Don't need to flush other CPUs. They will just do a page + fault and flush it lazily. */ + __flush_tlb_one(addr); + + out: spin_unlock(&mm->page_table_lock); return ret; } +int arch_hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, int write_access) +{ + pmd_t *pmd; + pgd_t *pgd; + + if (write_access && !(vma->vm_flags & VM_WRITE)) + return VM_FAULT_SIGBUS; + + spin_lock(&mm->page_table_lock); + pgd = pgd_offset(mm, address); + if (pgd_none(*pgd)) + return hugetlb_alloc_fault(mm, vma, address, write_access); + + pmd = pmd_offset(pgd, address); + if (pmd_none(*pmd)) + return hugetlb_alloc_fault(mm, vma, address, write_access); + + BUG_ON(!pmd_large(*pmd)); + + /* must have been a race. Flush the TLB. NX not supported yet. */ + + __flush_tlb_one(address); + spin_lock(&mm->page_table_lock); + return VM_FAULT_MINOR; +} + +int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma) +{ + return 0; +} + static void update_and_free_page(struct page *page) { int j; struct page *map; map = page; - htlbzone_pages--; + htlbzone_pages--; for (j = 0; j < (HPAGE_SIZE / PAGE_SIZE); j++) { map->flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced | 1 << PG_dirty | 1 << PG_active | 1 << PG_reserved | diff -burpN -X ../KDIFX linux-2.6.2/mm/memory.c linux-2.6.2-numa/mm/memory.c --- linux-2.6.2/mm/memory.c 2004-02-20 18:31:32.000000000 +0100 +++ linux-2.6.2-numa/mm/memory.c 2004-02-18 20:08:40.000000000 +0100 @@ -1576,6 +1593,15 @@ static inline int handle_pte_fault(struc return VM_FAULT_MINOR; } + +/* Can be overwritten by the architecture */ +int __attribute__((weak)) arch_hugetlb_fault(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, int write_access) +{ + return VM_FAULT_SIGBUS; +} + /* * By the time we get here, we already hold the mm semaphore */ @@ -1591,7 +1617,7 @@ int handle_mm_fault(struct mm_struct *mm inc_page_state(pgfault); if (is_vm_hugetlb_page(vma)) - return VM_FAULT_SIGBUS; /* mapping truncation does this. */ + return arch_hugetlb_fault(mm, vma, address, write_access); /* * We need the page table lock to synchronize with kswapd ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hugetlbpages in very large memory machines....... 2004-03-13 3:48 ` Andi Kleen @ 2004-03-13 5:49 ` William Lee Irwin III 2004-03-14 2:45 ` Andrew Morton 1 sibling, 0 replies; 14+ messages in thread From: William Lee Irwin III @ 2004-03-13 5:49 UTC (permalink / raw) To: Andi Kleen; +Cc: Ray Bryant, lse-tech, linux-ia64@vger.kernel.org, linux-kernel On Sat, Mar 13, 2004 at 04:48:40AM +0100, Andi Kleen wrote: > One drawback is that the out of memory handling is lot less nicer > than it was before - when you run out of hugepages you get SIGBUS > now instead of a ENOMEM from mmap. Maybe some prereservation would > make sense, but that would be somewhat harder. Alternatively > fall back to smaller pages if possible (I was told it isn't easily > possible on IA64) That's not entirely true. Whether it's feasible depends on how the MMU is used. The HPW (Hardware Pagetable Walker) and short mode of the VHPT insist upon pagesize being a per-region attribute, where regions are something like 60-bit areas of virtualspace, which is likely what they're referring to. The VHPT in long mode should be capable of arbitrary virtual placement (modulo alignment of course). -- wli ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hugetlbpages in very large memory machines....... 2004-03-13 3:48 ` Andi Kleen 2004-03-13 5:49 ` William Lee Irwin III @ 2004-03-14 2:45 ` Andrew Morton 1 sibling, 0 replies; 14+ messages in thread From: Andrew Morton @ 2004-03-14 2:45 UTC (permalink / raw) To: Andi Kleen; +Cc: raybry, lse-tech, linux-ia64, linux-kernel Andi Kleen <ak@suse.de> wrote: > > > We've looked at allocating and zeroing hugetlbpages at fault time, which > > would at least allow multiple processors to be thrown at the problem. > > Question is, has anyone else been working on > > this problem and might they have prototype code they could share with us? > > Yes. I ran into exactly this problem with NUMA API too. > mbind() runs after mmap, but it cannot work anymore when > the pages are already allocated. > > I fixed it on x86-64/i386 by allocating the pages lazily. > Doing it for IA64 has been on the todo list too. > > i386/x86-64 Code as an example attached. > > One drawback is that the out of memory handling is lot less nicer > than it was before - when you run out of hugepages you get SIGBUS > now instead of a ENOMEM from mmap. Maybe some prereservation would > make sense, but that would be somewhat harder. Alternatively > fall back to smaller pages if possible (I was told it isn't easily > possible on IA64) Demand-paging the hugepages is a decent feature to have, and ISTR resisting it before for this reason. Even though it's early in the 2.6 series I'd be a bit worried about breaking existing hugetlb users in this way. Yes, the pages are preallocated so it is unlikely that a working setup is suddenly going to break. Unless someone is using the return value from mmap to find out how many pages they can get. So ho-hum. I think it needs to be back-compatible. Could we add MAP_NO_PREFAULT? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hugetlbpages in very large memory machines....... 2004-03-13 3:44 Ray Bryant 2004-03-13 3:48 ` Andi Kleen @ 2004-03-13 3:55 ` William Lee Irwin III 2004-03-13 4:56 ` Hirokazu Takahashi 2004-03-15 15:28 ` jlnance 3 siblings, 0 replies; 14+ messages in thread From: William Lee Irwin III @ 2004-03-13 3:55 UTC (permalink / raw) To: Ray Bryant; +Cc: lse-tech, linux-ia64@vger.kernel.org, linux-kernel On Fri, Mar 12, 2004 at 09:44:03PM -0600, Ray Bryant wrote: > We've run into a scaling problem using hugetlbpages in very large memory > machines, e. g. machines with 1TB or more of main memory. The problem is > that hugetlbpage pages are not faulted in, rather they are zeroed and > mapped in in by hugetlb_prefault() (at least on ia64), which is called in > response to the user's mmap() request. The net is that all of the hugetlb > pages end up being allocated and zeroed by a single thread, and if most of > the machine's memory is allocated to hugetlb pages, and there is 1 TB or > more of main memory, zeroing and allocating all of those pages can take a > long time (500 s or more). > We've looked at allocating and zeroing hugetlbpages at fault time, which > would at least allow multiple processors to be thrown at the problem. > Question is, has anyone else been working on > this problem and might they have prototype code they could share with us? This actually is largely a question of architecture-dependent code, so the answer will depend on whether your architecture matches those of the others who have had a need to arrange this. Basically, all you really need to do is to check the vma and call either a hugetlb-specific fault handler or handle_mm_fault() depending on whether hugetlb is configured. Once you've gotten that far, it's only a question of implementing the methods to work together properly when driven by upper layers. The reason why this wasn't done up-front was that there wasn't a demonstrable need to do so. The issue you're citing is exactly the kind of demonstration needed to motivate its inclusion. -- wli ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hugetlbpages in very large memory machines....... 2004-03-13 3:44 Ray Bryant 2004-03-13 3:48 ` Andi Kleen 2004-03-13 3:55 ` William Lee Irwin III @ 2004-03-13 4:56 ` Hirokazu Takahashi 2004-03-16 0:30 ` Nobuhiko Yoshida 2004-03-15 15:28 ` jlnance 3 siblings, 1 reply; 14+ messages in thread From: Hirokazu Takahashi @ 2004-03-13 4:56 UTC (permalink / raw) To: raybry; +Cc: lse-tech, linux-ia64, linux-kernel, n-yoshida Hello, My following patch might help you. It inclueds pagefault routine for hugetlbpages. If you want to use it for your purpose, you need to remove some code from hugetlb_prefault() that will call hugetlb_fault(). http://people.valinux.co.jp/~taka/patches/va01-hugepagefault.patch But it's just for IA32. I heard that n-yoshida@pst.fujitsu.com was porting this patch on IA64. > We've run into a scaling problem using hugetlbpages in very large memory machines, e. g. machines > with 1TB or more of main memory. The problem is that hugetlbpage pages are not faulted in, rather > they are zeroed and mapped in in by hugetlb_prefault() (at least on ia64), which is called in > response to the user's mmap() request. The net is that all of the hugetlb pages end up being > allocated and zeroed by a single thread, and if most of the machine's memory is allocated to hugetlb > pages, and there is 1 TB or more of main memory, zeroing and allocating all of those pages can take > a long time (500 s or more). > > We've looked at allocating and zeroing hugetlbpages at fault time, which would at least allow > multiple processors to be thrown at the problem. Question is, has anyone else been working on > this problem and might they have prototype code they could share with us? > > Thanks, > -- > Best Regards, > Ray Thank you, Hirokazu Takahashi. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hugetlbpages in very large memory machines....... 2004-03-13 4:56 ` Hirokazu Takahashi @ 2004-03-16 0:30 ` Nobuhiko Yoshida 2004-03-16 1:54 ` Andi Kleen 0 siblings, 1 reply; 14+ messages in thread From: Nobuhiko Yoshida @ 2004-03-16 0:30 UTC (permalink / raw) To: raybry, linux-kernel; +Cc: lse-tech, linux-ia64, Hirokazu Takahashi Hello, Hirokazu Takahashi <taka@valinux.co.jp> : > Hello, > > My following patch might help you. It inclueds pagefault routine > for hugetlbpages. If you want to use it for your purpose, you need to > remove some code from hugetlb_prefault() that will call hugetlb_fault(). > http://people.valinux.co.jp/~taka/patches/va01-hugepagefault.patch > > But it's just for IA32. > > I heard that n-yoshida@pst.fujitsu.com was porting this patch > on IA64. Below is the patch I ported Takahashi-san's one for IA64. However, my patch is for kernel 2.6.0 and cannot be appiled to 2.6.1 or later. Thank you, Nobuhiko Yoshida diff -dupr linux-2.6.0.org/arch/ia64/mm/hugetlbpage.c linux-2.6.0.HugeTLB/arch/ia64/mm/hugetlbpage.c --- linux-2.6.0.org/arch/ia64/mm/hugetlbpage.c 2003-12-18 11:58:56.000000000 +0900 +++ linux-2.6.0.HugeTLB/arch/ia64/mm/hugetlbpage.c 2004-01-06 14:26:53.000000000 +0900 @@ -170,8 +170,10 @@ int copy_hugetlb_page_range(struct mm_st goto nomem; src_pte = huge_pte_offset(src, addr); entry = *src_pte; - ptepage = pte_page(entry); - get_page(ptepage); + if (!pte_none(entry)) { + ptepage = pte_page(entry); + get_page(ptepage); + } set_pte(dst_pte, entry); dst->rss += (HPAGE_SIZE / PAGE_SIZE); addr += HPAGE_SIZE; @@ -195,6 +197,12 @@ follow_hugetlb_page(struct mm_struct *mm do { pstart = start & HPAGE_MASK; ptep = huge_pte_offset(mm, start); + + if (!ptep || pte_none(*ptep)) { + hugetlb_fault(mm, vma, 0, start); + ptep = huge_pte_offset(mm, start); + } + pte = *ptep; back1: @@ -236,6 +244,12 @@ struct page *follow_huge_addr(struct mm_ pte_t *ptep; ptep = huge_pte_offset(mm, addr); + + if (!ptep || pte_none(*ptep)) { + hugetlb_fault(mm, vma, 0, addr); + ptep = huge_pte_offset(mm, addr); + } + page = pte_page(*ptep); page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT); get_page(page); @@ -246,7 +260,8 @@ int pmd_huge(pmd_t pmd) return 0; } struct page * -follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write) +follow_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, int write) { return NULL; } @@ -518,6 +533,48 @@ int is_hugepage_mem_enough(size_t size) return 1; } + +int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int write_access, unsigned long address) +{ + struct file *file = vma->vm_file; + struct address_space *mapping = file->f_dentry->d_inode->i_mapping; + struct page *page; + unsigned long idx; + pte_t *pte; + int ret = VM_FAULT_MINOR; + + BUG_ON(vma->vm_start & ~HPAGE_MASK); + BUG_ON(vma->vm_end & ~HPAGE_MASK); + + spin_lock(&mm->page_table_lock); + + idx = ((address - vma->vm_start) >> HPAGE_SHIFT) + + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); + page = find_get_page(mapping, idx); + + if (!page) { + page = alloc_hugetlb_page(); + if (!page) { + ret = VM_FAULT_SIGBUS; + goto out; + } + ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC); + unlock_page(page); + if (ret) { + free_huge_page(page); + ret = VM_FAULT_SIGBUS; + goto out; + } + } + pte = huge_pte_alloc(mm, address); + set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE); +/* update_mmu_cache(vma, address, *pte); */ +out: + spin_unlock(&mm->page_table_lock); + return ret; +} + + static struct page *hugetlb_nopage(struct vm_area_struct * area, unsigned long address, int unused) { BUG(); ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hugetlbpages in very large memory machines....... 2004-03-16 0:30 ` Nobuhiko Yoshida @ 2004-03-16 1:54 ` Andi Kleen 2004-03-16 2:32 ` Hirokazu Takahashi 2004-03-16 3:15 ` Nobuhiko Yoshida 0 siblings, 2 replies; 14+ messages in thread From: Andi Kleen @ 2004-03-16 1:54 UTC (permalink / raw) To: Nobuhiko Yoshida Cc: raybry, linux-kernel, lse-tech, linux-ia64, Hirokazu Takahashi > + pte = huge_pte_alloc(mm, address); > + set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE); This looks broken. Another CPU could have raced to the same fault and already added an PTE here. You have to handle that. (my i386 version originally had the same problem) > +/* update_mmu_cache(vma, address, *pte); */ I have not studied low level IA64 VM in detail, but don't you need some kind of TLB flush here? -Andi ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hugetlbpages in very large memory machines....... 2004-03-16 1:54 ` Andi Kleen @ 2004-03-16 2:32 ` Hirokazu Takahashi 2004-03-16 3:20 ` Hirokazu Takahashi 2004-03-16 3:15 ` Nobuhiko Yoshida 1 sibling, 1 reply; 14+ messages in thread From: Hirokazu Takahashi @ 2004-03-16 2:32 UTC (permalink / raw) To: ak; +Cc: n-yoshida, raybry, linux-kernel, lse-tech, linux-ia64 Hello, > > + pte = huge_pte_alloc(mm, address); > > + set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE); > > This looks broken. Another CPU could have raced to the same fault > and already added an PTE here. You have to handle that. > > (my i386 version originally had the same problem) Yes, you are true. In the fault handler, we should use find_lock_page() instead of find_get_page() to find a hugepage associated with the fault address. After that pte_none(*pte) should be called again to check whether some races has happened. > > +/* update_mmu_cache(vma, address, *pte); */ > > I have not studied low level IA64 VM in detail, but don't you need > some kind of TLB flush here? > > -Andi Thank you, Hirokazu Takahashi. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hugetlbpages in very large memory machines....... 2004-03-16 2:32 ` Hirokazu Takahashi @ 2004-03-16 3:20 ` Hirokazu Takahashi 0 siblings, 0 replies; 14+ messages in thread From: Hirokazu Takahashi @ 2004-03-16 3:20 UTC (permalink / raw) To: ak; +Cc: n-yoshida, raybry, linux-kernel, lse-tech, linux-ia64 Hello, > Yes, you are true. > In the fault handler, we should use find_lock_page() instead of > find_get_page() to find a hugepage associated with the fault address. Sorry, locking page is not needed. > After that pte_none(*pte) should be called again to check whether > some races has happened. While checking, mm->page_table_lock have to be locked. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hugetlbpages in very large memory machines....... 2004-03-16 1:54 ` Andi Kleen 2004-03-16 2:32 ` Hirokazu Takahashi @ 2004-03-16 3:15 ` Nobuhiko Yoshida 2004-04-01 9:10 ` Nobuhiko Yoshida 1 sibling, 1 reply; 14+ messages in thread From: Nobuhiko Yoshida @ 2004-03-16 3:15 UTC (permalink / raw) To: Andi Kleen, linux-kernel; +Cc: raybry, lse-tech, linux-ia64, Hirokazu Takahashi Hello, > > +/* update_mmu_cache(vma, address, *pte); */ > > I have not studied low level IA64 VM in detail, but don't you need > some kind of TLB flush here? Oh! Yes. Perhaps, TLB flush is needed here. Thank you, Nobuhiko Yoshida ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hugetlbpages in very large memory machines....... 2004-03-16 3:15 ` Nobuhiko Yoshida @ 2004-04-01 9:10 ` Nobuhiko Yoshida 0 siblings, 0 replies; 14+ messages in thread From: Nobuhiko Yoshida @ 2004-04-01 9:10 UTC (permalink / raw) To: Andi Kleen, linux-kernel Cc: raybry, lse-tech, linux-ia64, Hirokazu Takahashi, lhms-devel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=us-ascii, Size: 8783 bytes --] Nobuhiko Yoshida <n-yoshida@pst.fujitsu.com> wroteF > Hello, > > > > +/* update_mmu_cache(vma, address, *pte); */ > > > > I have not studied low level IA64 VM in detail, but don't you need > > some kind of TLB flush here? > > Oh! Yes. > Perhaps, TLB flush is needed here. - Below is the patch that revised what I contributed before. - I added the flush of TLB and icache. How To Use: 1. Download linux-2.6.0 source tree 2. Apply the below patch for linux-2.6.0 Thank you, Nobuhiko Yoshida diff -dupr linux-2.6.0/arch/i386/mm/hugetlbpage.c linux-2.6.0.HugeTLB/arch/i386/mm/hugetlbpage.c --- linux-2.6.0/arch/i386/mm/hugetlbpage.c 2003-12-18 11:59:38.000000000 +0900 +++ linux-2.6.0.HugeTLB/arch/i386/mm/hugetlbpage.c 2004-04-01 11:48:56.000000000 +0900 @@ -142,8 +142,10 @@ int copy_hugetlb_page_range(struct mm_st goto nomem; src_pte = huge_pte_offset(src, addr); entry = *src_pte; - ptepage = pte_page(entry); - get_page(ptepage); + if (!pte_none(entry)) { + ptepage = pte_page(entry); + get_page(ptepage); + } set_pte(dst_pte, entry); dst->rss += (HPAGE_SIZE / PAGE_SIZE); addr += HPAGE_SIZE; @@ -173,6 +175,11 @@ follow_hugetlb_page(struct mm_struct *mm pte = huge_pte_offset(mm, vaddr); + if (!pte || pte_none(*pte)) { + hugetlb_fault(mm, vma, 0, vaddr); + pte = huge_pte_offset(mm, vaddr); + } + /* hugetlb should be locked, and hence, prefaulted */ WARN_ON(!pte || pte_none(*pte)); @@ -261,12 +268,17 @@ int pmd_huge(pmd_t pmd) } struct page * -follow_huge_pmd(struct mm_struct *mm, unsigned long address, - pmd_t *pmd, int write) +follow_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, int write) { struct page *page; page = pte_page(*(pte_t *)pmd); + + if (!page) { + hugetlb_fault(mm, vma, write, address); + page = pte_page(*(pte_t *)pmd); + } if (page) { page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT); get_page(page); @@ -527,6 +539,48 @@ int is_hugepage_mem_enough(size_t size) return (size + ~HPAGE_MASK)/HPAGE_SIZE <= htlbpagemem; } + +int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int write_access, unsigned long address) +{ + struct file *file = vma->vm_file; + struct address_space *mapping = file->f_dentry->d_inode->i_mapping; + struct page *page; + unsigned long idx; + pte_t *pte; + int ret = VM_FAULT_MINOR; + + BUG_ON(vma->vm_start & ~HPAGE_MASK); + BUG_ON(vma->vm_end & ~HPAGE_MASK); + + spin_lock(&mm->page_table_lock); + + idx = ((address - vma->vm_start) >> HPAGE_SHIFT) + + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); + page = find_get_page(mapping, idx); + + if (!page) { + page = alloc_hugetlb_page(); + if (!page) { + ret = VM_FAULT_SIGBUS; + goto out; + } + ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC); + unlock_page(page); + if (ret) { + free_huge_page(page); + ret = VM_FAULT_SIGBUS; + goto out; + } + } + pte = huge_pte_alloc(mm, address); + set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE); +/* update_mmu_cache(vma, address, *pte); */ +out: + spin_unlock(&mm->page_table_lock); + return ret; +} + + /* * We cannot handle pagefaults against hugetlb pages at all. They cause * handle_mm_fault() to try to instantiate regular-sized pages in the diff -dupr linux-2.6.0/arch/ia64/mm/hugetlbpage.c linux-2.6.0.HugeTLB/arch/ia64/mm/hugetlbpage.c --- linux-2.6.0/arch/ia64/mm/hugetlbpage.c 2003-12-18 11:58:56.000000000 +0900 +++ linux-2.6.0.HugeTLB/arch/ia64/mm/hugetlbpage.c 2004-03-22 11:29:01.000000000 +0900 @@ -170,8 +170,10 @@ int copy_hugetlb_page_range(struct mm_st goto nomem; src_pte = huge_pte_offset(src, addr); entry = *src_pte; - ptepage = pte_page(entry); - get_page(ptepage); + if (!pte_none(entry)) { + ptepage = pte_page(entry); + get_page(ptepage); + } set_pte(dst_pte, entry); dst->rss += (HPAGE_SIZE / PAGE_SIZE); addr += HPAGE_SIZE; @@ -195,6 +197,12 @@ follow_hugetlb_page(struct mm_struct *mm do { pstart = start & HPAGE_MASK; ptep = huge_pte_offset(mm, start); + + if (!ptep || pte_none(*ptep)) { + hugetlb_fault(mm, vma, 0, start); + ptep = huge_pte_offset(mm, start); + } + pte = *ptep; back1: @@ -236,6 +244,12 @@ struct page *follow_huge_addr(struct mm_ pte_t *ptep; ptep = huge_pte_offset(mm, addr); + + if (!ptep || pte_none(*ptep)) { + hugetlb_fault(mm, vma, 0, addr); + ptep = huge_pte_offset(mm, addr); + } + page = pte_page(*ptep); page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT); get_page(page); @@ -246,7 +260,8 @@ int pmd_huge(pmd_t pmd) return 0; } struct page * -follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write) +follow_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, int write) { return NULL; } @@ -518,6 +533,49 @@ int is_hugepage_mem_enough(size_t size) return 1; } + +int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int write_access, unsigned long address) +{ + struct file *file = vma->vm_file; + struct address_space *mapping = file->f_dentry->d_inode->i_mapping; + struct page *page; + unsigned long idx; + pte_t *pte; + int ret = VM_FAULT_MINOR; + + BUG_ON(vma->vm_start & ~HPAGE_MASK); + BUG_ON(vma->vm_end & ~HPAGE_MASK); + + spin_lock(&mm->page_table_lock); + + idx = ((address - vma->vm_start) >> HPAGE_SHIFT) + + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); + page = find_get_page(mapping, idx); + + if (!page) { + page = alloc_hugetlb_page(); + if (!page) { + ret = VM_FAULT_SIGBUS; + goto out; + } + ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC); + unlock_page(page); + if (ret) { + free_huge_page(page); + ret = VM_FAULT_SIGBUS; + goto out; + } + } + pte = huge_pte_alloc(mm, address); + set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE); + flush_tlb_range(vma, address, address + HPAGE_SIZE); + update_mmu_cache(vma, address, *pte); +out: + spin_unlock(&mm->page_table_lock); + return ret; +} + + static struct page *hugetlb_nopage(struct vm_area_struct * area, unsigned long address, int unused) { BUG(); diff -dupr linux-2.6.0/include/linux/hugetlb.h linux-2.6.0.HugeTLB/include/linux/hugetlb.h --- linux-2.6.0/include/linux/hugetlb.h 2003-12-18 11:58:49.000000000 +0900 +++ linux-2.6.0.HugeTLB/include/linux/hugetlb.h 2003-12-19 09:47:25.000000000 +0900 @@ -23,10 +23,12 @@ struct page *follow_huge_addr(struct mm_ unsigned long address, int write); struct vm_area_struct *hugepage_vma(struct mm_struct *mm, unsigned long address); -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address, - pmd_t *pmd, int write); +struct page *follow_huge_pmd(struct mm_struct *mm, struct vm_area_struct *, + unsigned long address, pmd_t *pmd, int write); int is_aligned_hugepage_range(unsigned long addr, unsigned long len); int pmd_huge(pmd_t pmd); +extern int hugetlb_fault(struct mm_struct *, struct vm_area_struct *, + int, unsigned long); extern int htlbpage_max; @@ -63,6 +65,7 @@ static inline int is_vm_hugetlb_page(str #define is_aligned_hugepage_range(addr, len) 0 #define pmd_huge(x) 0 #define is_hugepage_only_range(addr, len) 0 +#define hugetlb_fault(mm, vma, write, addr) 0 #ifndef HPAGE_MASK #define HPAGE_MASK 0 /* Keep the compiler happy */ diff -dupr linux-2.6.0/mm/memory.c linux-2.6.0.HugeTLB/mm/memory.c --- linux-2.6.0/mm/memory.c 2003-12-18 11:58:48.000000000 +0900 +++ linux-2.6.0.HugeTLB/mm/memory.c 2003-12-19 09:47:46.000000000 +0900 @@ -640,7 +640,7 @@ follow_page(struct mm_struct *mm, unsign if (pmd_none(*pmd)) goto out; if (pmd_huge(*pmd)) - return follow_huge_pmd(mm, address, pmd, write); + return follow_huge_pmd(mm, vma, address, pmd, write); if (pmd_bad(*pmd)) goto out; @@ -1603,7 +1603,7 @@ int handle_mm_fault(struct mm_struct *mm inc_page_state(pgfault); if (is_vm_hugetlb_page(vma)) - return VM_FAULT_SIGBUS; /* mapping truncation does this. */ + return hugetlb_fault(mm, vma, write_access, address); /* * We need the page table lock to synchronize with kswapd ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Hugetlbpages in very large memory machines....... 2004-03-13 3:44 Ray Bryant ` (2 preceding siblings ...) 2004-03-13 4:56 ` Hirokazu Takahashi @ 2004-03-15 15:28 ` jlnance 3 siblings, 0 replies; 14+ messages in thread From: jlnance @ 2004-03-15 15:28 UTC (permalink / raw) To: linux-kernel On Fri, Mar 12, 2004 at 09:44:03PM -0600, Ray Bryant wrote: > We've run into a scaling problem using hugetlbpages in very large memory > machines, e. g. machines with 1TB or more of main memory. You know, when I started using Linux it wouldn't support more than 16M of ram. No one complained because no one using Linux had a machine with more than 16M of ram. It looks like things have progressed a bit since then :-) Jim ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2004-04-01 9:12 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-03-13 3:45 Hugetlbpages in very large memory machines Ray Bryant -- strict thread matches above, loose matches on Subject: below -- 2004-03-13 3:44 Ray Bryant 2004-03-13 3:48 ` Andi Kleen 2004-03-13 5:49 ` William Lee Irwin III 2004-03-14 2:45 ` Andrew Morton 2004-03-13 3:55 ` William Lee Irwin III 2004-03-13 4:56 ` Hirokazu Takahashi 2004-03-16 0:30 ` Nobuhiko Yoshida 2004-03-16 1:54 ` Andi Kleen 2004-03-16 2:32 ` Hirokazu Takahashi 2004-03-16 3:20 ` Hirokazu Takahashi 2004-03-16 3:15 ` Nobuhiko Yoshida 2004-04-01 9:10 ` Nobuhiko Yoshida 2004-03-15 15:28 ` jlnance
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox