Hugetlbpages in very large memory machines.......

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Hugetlbpages in very large memory machines.......
@ 2004-03-13  3:45 Ray Bryant
  0 siblings, 0 replies; 14+ messages in thread
From: Ray Bryant @ 2004-03-13  3:45 UTC (permalink / raw)
  To: lse-tech, linux-ia64@vger.kernel.org, linux-kernel

We've run into a scaling problem using hugetlbpages in very large memory machines, e. g. machines 
with 1TB or more of main memory.  The problem is that hugetlbpage pages are not faulted in, rather 
they are zeroed and mapped in in by hugetlb_prefault() (at least on ia64), which is called in 
response to the user's mmap() request.  The net is that all of the hugetlb pages end up being 
allocated and zeroed by a single thread, and if most of the machine's memory is allocated to hugetlb 
pages, and there is 1 TB or more of main memory, zeroing and allocating all of those pages can take 
a long time (500 s or more).

We've looked at allocating and zeroing hugetlbpages at fault time, which would at least allow 
multiple processors to be thrown at the problem.  Question is, has anyone else been working on
this problem and might they have prototype code they could share with us?

Thanks,
-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Hugetlbpages in very large memory machines.......
@ 2004-03-13  3:44 Ray Bryant
  2004-03-13  3:48 ` Andi Kleen
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Ray Bryant @ 2004-03-13  3:44 UTC (permalink / raw)
  To: lse-tech, linux-ia64@vger.kernel.org, linux-kernel

We've run into a scaling problem using hugetlbpages in very large memory machines, e. g. machines 
with 1TB or more of main memory.  The problem is that hugetlbpage pages are not faulted in, rather 
they are zeroed and mapped in in by hugetlb_prefault() (at least on ia64), which is called in 
response to the user's mmap() request.  The net is that all of the hugetlb pages end up being 
allocated and zeroed by a single thread, and if most of the machine's memory is allocated to hugetlb 
pages, and there is 1 TB or more of main memory, zeroing and allocating all of those pages can take 
a long time (500 s or more).

We've looked at allocating and zeroing hugetlbpages at fault time, which would at least allow 
multiple processors to be thrown at the problem.  Question is, has anyone else been working on
this problem and might they have prototype code they could share with us?

Thanks,
-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-13  3:44 Ray Bryant
@ 2004-03-13  3:48 ` Andi Kleen
  2004-03-13  5:49   ` William Lee Irwin III
  2004-03-14  2:45   ` Andrew Morton
  2004-03-13  3:55 ` William Lee Irwin III
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 14+ messages in thread
From: Andi Kleen @ 2004-03-13  3:48 UTC (permalink / raw)
  To: Ray Bryant; +Cc: lse-tech, linux-ia64@vger.kernel.org, linux-kernel

On Fri, Mar 12, 2004 at 09:44:03PM -0600, Ray Bryant wrote:
> We've run into a scaling problem using hugetlbpages in very large memory 
> machines, e. g. machines with 1TB or more of main memory.  The problem is 
> that hugetlbpage pages are not faulted in, rather they are zeroed and 
> mapped in in by hugetlb_prefault() (at least on ia64), which is called in 
> response to the user's mmap() request.  The net is that all of the hugetlb 
> pages end up being allocated and zeroed by a single thread, and if most of 
> the machine's memory is allocated to hugetlb pages, and there is 1 TB or 
> more of main memory, zeroing and allocating all of those pages can take a 
> long time (500 s or more).
> 
> We've looked at allocating and zeroing hugetlbpages at fault time, which 
> would at least allow multiple processors to be thrown at the problem.  
> Question is, has anyone else been working on
> this problem and might they have prototype code they could share with us?

Yes. I ran into exactly this problem with NUMA API too. 
mbind() runs after mmap, but it cannot work anymore when
the pages are already allocated.

I fixed it on x86-64/i386 by allocating the pages lazily.
Doing it for IA64 has been on the todo list too.

i386/x86-64 Code as an example attached.

One drawback is that the out of memory handling is lot less nicer
than it was before - when you run out of hugepages you get SIGBUS
now instead of a ENOMEM from mmap. Maybe some prereservation would
make sense, but that would be somewhat harder. Alternatively
fall back to smaller pages if possible (I was told it isn't easily
possible on IA64)

-Andi


diff -burpN -X ../KDIFX linux-2.6.2/arch/i386/mm/hugetlbpage.c linux-2.6.2-numa/arch/i386/mm/hugetlbpage.c
--- linux-2.6.2/arch/i386/mm/hugetlbpage.c	2004-02-24 20:48:10.000000000 +0100
+++ linux-2.6.2-numa/arch/i386/mm/hugetlbpage.c	2004-02-20 18:52:57.000000000 +0100
@@ -329,41 +333,43 @@ zap_hugepage_range(struct vm_area_struct
 	spin_unlock(&mm->page_table_lock);
 }
 
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
+/* page_table_lock hold on entry. */
+static int 
+hugetlb_alloc_fault(struct mm_struct *mm, struct vm_area_struct *vma, 
+			       unsigned long addr, int write_access)
 {
-	struct mm_struct *mm = current->mm;
-	unsigned long addr;
-	int ret = 0;
-
-	BUG_ON(vma->vm_start & ~HPAGE_MASK);
-	BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
-	spin_lock(&mm->page_table_lock);
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
 		unsigned long idx;
-		pte_t *pte = huge_pte_alloc(mm, addr);
-		struct page *page;
+	int ret;
+	pte_t *pte;
+	struct page *page = NULL;
+	struct address_space *mapping = vma->vm_file->f_mapping;
 
+	pte = huge_pte_alloc(mm, addr); 
 		if (!pte) {
-			ret = -ENOMEM;
+		ret = VM_FAULT_OOM;
 			goto out;
 		}
-		if (!pte_none(*pte))
-			continue;
 
 		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
 			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
 		page = find_get_page(mapping, idx);
 		if (!page) {
-			/* charge the fs quota first */
-			if (hugetlb_get_quota(mapping)) {
-				ret = -ENOMEM;
+		/* Should do this at prefault time, but that gets us into
+		   trouble with freeing right now. */
+		ret = hugetlb_get_quota(mapping);
+		if (ret) {
+			ret = VM_FAULT_OOM;
 				goto out;
 			}
-			page = alloc_hugetlb_page();
+
+		page = alloc_hugetlb_page(vma);
 			if (!page) {
 				hugetlb_put_quota(mapping);
-				ret = -ENOMEM;
+			
+			/* Instead of OOMing here could just transparently use
+			   small pages. */
+			
+			ret = VM_FAULT_OOM;
 				goto out;
 			}
 			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
@@ -371,23 +377,62 @@ int hugetlb_prefault(struct address_spac
 			if (ret) {
 				hugetlb_put_quota(mapping);
 				free_huge_page(page);
+			ret = VM_FAULT_SIGBUS;
 				goto out;
 			}
-		}
+		ret = VM_FAULT_MAJOR; 
+	} else
+		ret = VM_FAULT_MINOR;
+		
 		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
-	}
-out:
+	/* Don't need to flush other CPUs. They will just do a page
+	   fault and flush it lazily. */
+	__flush_tlb_one(addr);
+	
+ out:
 	spin_unlock(&mm->page_table_lock);
 	return ret;
 }
 
+int arch_hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, 
+		       unsigned long address, int write_access)
+{ 
+	pmd_t *pmd;
+	pgd_t *pgd;
+
+	if (write_access && !(vma->vm_flags & VM_WRITE))
+		return VM_FAULT_SIGBUS;
+
+	spin_lock(&mm->page_table_lock);	
+	pgd = pgd_offset(mm, address); 
+	if (pgd_none(*pgd)) 
+		return hugetlb_alloc_fault(mm, vma, address, write_access); 
+
+	pmd = pmd_offset(pgd, address);
+	if (pmd_none(*pmd))
+		return hugetlb_alloc_fault(mm, vma, address, write_access); 
+
+	BUG_ON(!pmd_large(*pmd)); 
+
+	/* must have been a race. Flush the TLB. NX not supported yet. */ 
+
+	__flush_tlb_one(address); 
+	spin_lock(&mm->page_table_lock);	
+	return VM_FAULT_MINOR;
+} 
+
+int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
+{
+	return 0;
+}
+
 static void update_and_free_page(struct page *page)
 {
 	int j;
 	struct page *map;
 
 	map = page;
-	htlbzone_pages--;
+	htlbzone_pages--;
 	for (j = 0; j < (HPAGE_SIZE / PAGE_SIZE); j++) {
 		map->flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
 				1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
diff -burpN -X ../KDIFX linux-2.6.2/mm/memory.c linux-2.6.2-numa/mm/memory.c
--- linux-2.6.2/mm/memory.c	2004-02-20 18:31:32.000000000 +0100
+++ linux-2.6.2-numa/mm/memory.c	2004-02-18 20:08:40.000000000 +0100
@@ -1576,6 +1593,15 @@ static inline int handle_pte_fault(struc
 	return VM_FAULT_MINOR;
 }
 
+
+/* Can be overwritten by the architecture */
+int __attribute__((weak)) arch_hugetlb_fault(struct mm_struct *mm, 
+					     struct vm_area_struct *vma, 
+					     unsigned long address, int write_access)
+{
+	return VM_FAULT_SIGBUS;
+}
+
 /*
  * By the time we get here, we already hold the mm semaphore
  */
@@ -1591,7 +1617,7 @@ int handle_mm_fault(struct mm_struct *mm
 	inc_page_state(pgfault);
 
 	if (is_vm_hugetlb_page(vma))
-		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
+		return arch_hugetlb_fault(mm, vma, address, write_access);
 
 	/*
 	 * We need the page table lock to synchronize with kswapd

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-13  3:48 ` Andi Kleen
@ 2004-03-13  5:49   ` William Lee Irwin III
  2004-03-14  2:45   ` Andrew Morton
  1 sibling, 0 replies; 14+ messages in thread
From: William Lee Irwin III @ 2004-03-13  5:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ray Bryant, lse-tech, linux-ia64@vger.kernel.org, linux-kernel

On Sat, Mar 13, 2004 at 04:48:40AM +0100, Andi Kleen wrote:
> One drawback is that the out of memory handling is lot less nicer
> than it was before - when you run out of hugepages you get SIGBUS
> now instead of a ENOMEM from mmap. Maybe some prereservation would
> make sense, but that would be somewhat harder. Alternatively
> fall back to smaller pages if possible (I was told it isn't easily
> possible on IA64)

That's not entirely true. Whether it's feasible depends on how the
MMU is used. The HPW (Hardware Pagetable Walker) and short mode of the
VHPT insist upon pagesize being a per-region attribute, where regions
are something like 60-bit areas of virtualspace, which is likely what
they're referring to. The VHPT in long mode should be capable of
arbitrary virtual placement (modulo alignment of course).

-- wli

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-13  3:48 ` Andi Kleen
  2004-03-13  5:49   ` William Lee Irwin III
@ 2004-03-14  2:45   ` Andrew Morton
  1 sibling, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2004-03-14  2:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: raybry, lse-tech, linux-ia64, linux-kernel

Andi Kleen <ak@suse.de> wrote:
>
> > We've looked at allocating and zeroing hugetlbpages at fault time, which 
>  > would at least allow multiple processors to be thrown at the problem.  
>  > Question is, has anyone else been working on
>  > this problem and might they have prototype code they could share with us?
> 
>  Yes. I ran into exactly this problem with NUMA API too. 
>  mbind() runs after mmap, but it cannot work anymore when
>  the pages are already allocated.
> 
>  I fixed it on x86-64/i386 by allocating the pages lazily.
>  Doing it for IA64 has been on the todo list too.
> 
>  i386/x86-64 Code as an example attached.
> 
>  One drawback is that the out of memory handling is lot less nicer
>  than it was before - when you run out of hugepages you get SIGBUS
>  now instead of a ENOMEM from mmap. Maybe some prereservation would
>  make sense, but that would be somewhat harder. Alternatively
>  fall back to smaller pages if possible (I was told it isn't easily
>  possible on IA64)

Demand-paging the hugepages is a decent feature to have, and ISTR resisting
it before for this reason.

Even though it's early in the 2.6 series I'd be a bit worried about
breaking existing hugetlb users in this way.  Yes, the pages are
preallocated so it is unlikely that a working setup is suddenly going to
break.  Unless someone is using the return value from mmap to find out how
many pages they can get.

So ho-hum.  I think it needs to be back-compatible.  Could we add
MAP_NO_PREFAULT?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-13  3:44 Ray Bryant
  2004-03-13  3:48 ` Andi Kleen
@ 2004-03-13  3:55 ` William Lee Irwin III
  2004-03-13  4:56 ` Hirokazu Takahashi
  2004-03-15 15:28 ` jlnance
  3 siblings, 0 replies; 14+ messages in thread
From: William Lee Irwin III @ 2004-03-13  3:55 UTC (permalink / raw)
  To: Ray Bryant; +Cc: lse-tech, linux-ia64@vger.kernel.org, linux-kernel

On Fri, Mar 12, 2004 at 09:44:03PM -0600, Ray Bryant wrote:
> We've run into a scaling problem using hugetlbpages in very large memory 
> machines, e. g. machines with 1TB or more of main memory.  The problem is 
> that hugetlbpage pages are not faulted in, rather they are zeroed and 
> mapped in in by hugetlb_prefault() (at least on ia64), which is called in 
> response to the user's mmap() request.  The net is that all of the hugetlb 
> pages end up being allocated and zeroed by a single thread, and if most of 
> the machine's memory is allocated to hugetlb pages, and there is 1 TB or 
> more of main memory, zeroing and allocating all of those pages can take a 
> long time (500 s or more).
> We've looked at allocating and zeroing hugetlbpages at fault time, which 
> would at least allow multiple processors to be thrown at the problem.  
> Question is, has anyone else been working on
> this problem and might they have prototype code they could share with us?

This actually is largely a question of architecture-dependent code, so
the answer will depend on whether your architecture matches those of the
others who have had a need to arrange this.

Basically, all you really need to do is to check the vma and call either
a hugetlb-specific fault handler or handle_mm_fault() depending on whether
hugetlb is configured. Once you've gotten that far, it's only a question
of implementing the methods to work together properly when driven by
upper layers.

The reason why this wasn't done up-front was that there wasn't a
demonstrable need to do so. The issue you're citing is exactly the kind
of demonstration needed to motivate its inclusion.

-- wli

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-13  3:44 Ray Bryant
  2004-03-13  3:48 ` Andi Kleen
  2004-03-13  3:55 ` William Lee Irwin III
@ 2004-03-13  4:56 ` Hirokazu Takahashi
  2004-03-16  0:30   ` Nobuhiko Yoshida
  2004-03-15 15:28 ` jlnance
  3 siblings, 1 reply; 14+ messages in thread
From: Hirokazu Takahashi @ 2004-03-13  4:56 UTC (permalink / raw)
  To: raybry; +Cc: lse-tech, linux-ia64, linux-kernel, n-yoshida

Hello,

My following patch might help you. It inclueds pagefault routine
for hugetlbpages. If you want to use it for your purpose, you need to
remove some code from hugetlb_prefault() that will call hugetlb_fault().
http://people.valinux.co.jp/~taka/patches/va01-hugepagefault.patch

But it's just for IA32.

I heard that n-yoshida@pst.fujitsu.com was porting this patch
on IA64.

> We've run into a scaling problem using hugetlbpages in very large memory machines, e. g. machines 
> with 1TB or more of main memory.  The problem is that hugetlbpage pages are not faulted in, rather 
> they are zeroed and mapped in in by hugetlb_prefault() (at least on ia64), which is called in 
> response to the user's mmap() request.  The net is that all of the hugetlb pages end up being 
> allocated and zeroed by a single thread, and if most of the machine's memory is allocated to hugetlb 
> pages, and there is 1 TB or more of main memory, zeroing and allocating all of those pages can take 
> a long time (500 s or more).
> 
> We've looked at allocating and zeroing hugetlbpages at fault time, which would at least allow 
> multiple processors to be thrown at the problem.  Question is, has anyone else been working on
> this problem and might they have prototype code they could share with us?
> 
> Thanks,
> -- 
> Best Regards,
> Ray


Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-13  4:56 ` Hirokazu Takahashi
@ 2004-03-16  0:30   ` Nobuhiko Yoshida
  2004-03-16  1:54     ` Andi Kleen
  0 siblings, 1 reply; 14+ messages in thread
From: Nobuhiko Yoshida @ 2004-03-16  0:30 UTC (permalink / raw)
  To: raybry, linux-kernel; +Cc: lse-tech, linux-ia64, Hirokazu Takahashi

Hello,

Hirokazu Takahashi <taka@valinux.co.jp> :
> Hello,
> 
> My following patch might help you. It inclueds pagefault routine
> for hugetlbpages. If you want to use it for your purpose, you need to
> remove some code from hugetlb_prefault() that will call hugetlb_fault().
> http://people.valinux.co.jp/~taka/patches/va01-hugepagefault.patch
> 
> But it's just for IA32.
> 
> I heard that n-yoshida@pst.fujitsu.com was porting this patch
> on IA64.

Below is the patch I ported Takahashi-san's one for IA64.
However, my patch is for kernel 2.6.0 and cannot be
appiled to 2.6.1 or later.

Thank you,
Nobuhiko Yoshida

diff -dupr linux-2.6.0.org/arch/ia64/mm/hugetlbpage.c linux-2.6.0.HugeTLB/arch/ia64/mm/hugetlbpage.c
--- linux-2.6.0.org/arch/ia64/mm/hugetlbpage.c  2003-12-18 11:58:56.000000000 +0900
+++ linux-2.6.0.HugeTLB/arch/ia64/mm/hugetlbpage.c  2004-01-06 14:26:53.000000000 +0900
@@ -170,8 +170,10 @@ int copy_hugetlb_page_range(struct mm_st
            goto nomem;
        src_pte = huge_pte_offset(src, addr);
        entry = *src_pte;
-       ptepage = pte_page(entry);
-       get_page(ptepage);
+       if (!pte_none(entry)) {
+           ptepage = pte_page(entry);
+           get_page(ptepage);
+       }   
        set_pte(dst_pte, entry);
        dst->rss += (HPAGE_SIZE / PAGE_SIZE);
        addr += HPAGE_SIZE;
@@ -195,6 +197,12 @@ follow_hugetlb_page(struct mm_struct *mm
    do {
        pstart = start & HPAGE_MASK;
        ptep = huge_pte_offset(mm, start);
+
+       if (!ptep || pte_none(*ptep)) {
+           hugetlb_fault(mm, vma, 0, start);
+           ptep = huge_pte_offset(mm, start);
+       }
+
        pte = *ptep;
 
 back1:
@@ -236,6 +244,12 @@ struct page *follow_huge_addr(struct mm_
    pte_t *ptep;
 
    ptep = huge_pte_offset(mm, addr);
+
+   if (!ptep || pte_none(*ptep)) {
+       hugetlb_fault(mm, vma, 0, addr);
+       ptep = huge_pte_offset(mm, addr);
+   }
+
    page = pte_page(*ptep);
    page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
    get_page(page);
@@ -246,7 +260,8 @@ int pmd_huge(pmd_t pmd)
    return 0;
 }
 struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write)
+follow_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+       unsigned long address, pmd_t *pmd, int write)
 {
    return NULL;
 }
@@ -518,6 +533,48 @@ int is_hugepage_mem_enough(size_t size)
    return 1;
 }
 
+
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int write_access, unsigned long address)
+{
+   struct file *file = vma->vm_file;
+   struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
+   struct page *page;
+   unsigned long idx;
+   pte_t *pte;
+   int ret = VM_FAULT_MINOR;
+
+   BUG_ON(vma->vm_start & ~HPAGE_MASK);
+   BUG_ON(vma->vm_end & ~HPAGE_MASK);
+
+   spin_lock(&mm->page_table_lock);
+
+   idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+       + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+   page = find_get_page(mapping, idx);
+
+   if (!page) {
+       page = alloc_hugetlb_page();
+       if (!page) {
+           ret = VM_FAULT_SIGBUS;
+           goto out;
+       }
+       ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+       unlock_page(page);
+       if (ret) {
+           free_huge_page(page);
+           ret = VM_FAULT_SIGBUS;
+           goto out;
+       }
+   }
+   pte = huge_pte_alloc(mm, address);
+   set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+/*      update_mmu_cache(vma, address, *pte); */
+out:
+   spin_unlock(&mm->page_table_lock);
+   return ret;
+}
+
+
 static struct page *hugetlb_nopage(struct vm_area_struct * area, unsigned long address, int unused)
 {
    BUG();

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-16  0:30   ` Nobuhiko Yoshida
@ 2004-03-16  1:54     ` Andi Kleen
  2004-03-16  2:32       ` Hirokazu Takahashi
  2004-03-16  3:15       ` Nobuhiko Yoshida
  0 siblings, 2 replies; 14+ messages in thread
From: Andi Kleen @ 2004-03-16  1:54 UTC (permalink / raw)
  To: Nobuhiko Yoshida
  Cc: raybry, linux-kernel, lse-tech, linux-ia64, Hirokazu Takahashi

> +   pte = huge_pte_alloc(mm, address);
> +   set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);

This looks broken. Another CPU could have raced to the same fault
and already added an PTE here. You have to handle that.

(my i386 version originally had the same problem)


> +/*      update_mmu_cache(vma, address, *pte); */

I have not studied low level IA64 VM in detail, but don't you need
some kind of TLB flush here?

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-16  1:54     ` Andi Kleen
@ 2004-03-16  2:32       ` Hirokazu Takahashi
  2004-03-16  3:20         ` Hirokazu Takahashi
  2004-03-16  3:15       ` Nobuhiko Yoshida
  1 sibling, 1 reply; 14+ messages in thread
From: Hirokazu Takahashi @ 2004-03-16  2:32 UTC (permalink / raw)
  To: ak; +Cc: n-yoshida, raybry, linux-kernel, lse-tech, linux-ia64

Hello,

> > +   pte = huge_pte_alloc(mm, address);
> > +   set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
> 
> This looks broken. Another CPU could have raced to the same fault
> and already added an PTE here. You have to handle that.
> 
> (my i386 version originally had the same problem)

Yes, you are true.
In the fault handler, we should use find_lock_page() instead of
find_get_page() to find a hugepage associated with the fault address.
After that pte_none(*pte) should be called again to check whether 
some races has happened.

> > +/*      update_mmu_cache(vma, address, *pte); */
> 
> I have not studied low level IA64 VM in detail, but don't you need
> some kind of TLB flush here?
> 
> -Andi


Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-16  2:32       ` Hirokazu Takahashi
@ 2004-03-16  3:20         ` Hirokazu Takahashi
  0 siblings, 0 replies; 14+ messages in thread
From: Hirokazu Takahashi @ 2004-03-16  3:20 UTC (permalink / raw)
  To: ak; +Cc: n-yoshida, raybry, linux-kernel, lse-tech, linux-ia64

Hello,

> Yes, you are true.
> In the fault handler, we should use find_lock_page() instead of
> find_get_page() to find a hugepage associated with the fault address.

Sorry, locking page is not needed.

> After that pte_none(*pte) should be called again to check whether 
> some races has happened.

While checking, mm->page_table_lock have to be locked.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-16  1:54     ` Andi Kleen
  2004-03-16  2:32       ` Hirokazu Takahashi
@ 2004-03-16  3:15       ` Nobuhiko Yoshida
  2004-04-01  9:10         ` Nobuhiko Yoshida
  1 sibling, 1 reply; 14+ messages in thread
From: Nobuhiko Yoshida @ 2004-03-16  3:15 UTC (permalink / raw)
  To: Andi Kleen, linux-kernel; +Cc: raybry, lse-tech, linux-ia64, Hirokazu Takahashi

Hello,

> > +/*      update_mmu_cache(vma, address, *pte); */
> 
> I have not studied low level IA64 VM in detail, but don't you need
> some kind of TLB flush here?

Oh! Yes.
Perhaps, TLB flush is needed here.

Thank you,
Nobuhiko Yoshida

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-16  3:15       ` Nobuhiko Yoshida
@ 2004-04-01  9:10         ` Nobuhiko Yoshida
  0 siblings, 0 replies; 14+ messages in thread
From: Nobuhiko Yoshida @ 2004-04-01  9:10 UTC (permalink / raw)
  To: Andi Kleen, linux-kernel
  Cc: raybry, lse-tech, linux-ia64, Hirokazu Takahashi, lhms-devel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=us-ascii, Size: 8783 bytes --]

Nobuhiko Yoshida <n-yoshida@pst.fujitsu.com> wroteF
> Hello,
> 
> > > +/*      update_mmu_cache(vma, address, *pte); */
> > 
> > I have not studied low level IA64 VM in detail, but don't you need
> > some kind of TLB flush here?
> 
> Oh! Yes.
> Perhaps, TLB flush is needed here.

- Below is the patch that revised what I contributed before.
- I added the flush of TLB and icache. 

How To Use:
   1. Download linux-2.6.0 source tree
   2. Apply the below patch for linux-2.6.0

Thank you,
Nobuhiko Yoshida

diff -dupr linux-2.6.0/arch/i386/mm/hugetlbpage.c linux-2.6.0.HugeTLB/arch/i386/mm/hugetlbpage.c
--- linux-2.6.0/arch/i386/mm/hugetlbpage.c  2003-12-18 11:59:38.000000000 +0900
+++ linux-2.6.0.HugeTLB/arch/i386/mm/hugetlbpage.c  2004-04-01 11:48:56.000000000 +0900
@@ -142,8 +142,10 @@ int copy_hugetlb_page_range(struct mm_st
            goto nomem;
        src_pte = huge_pte_offset(src, addr);
        entry = *src_pte;
-       ptepage = pte_page(entry);
-       get_page(ptepage);
+       if (!pte_none(entry)) {
+           ptepage = pte_page(entry);
+           get_page(ptepage);
+       }
        set_pte(dst_pte, entry);
        dst->rss += (HPAGE_SIZE / PAGE_SIZE);
        addr += HPAGE_SIZE;
@@ -173,6 +175,11 @@ follow_hugetlb_page(struct mm_struct *mm
 
            pte = huge_pte_offset(mm, vaddr);
 
+           if (!pte || pte_none(*pte)) {
+               hugetlb_fault(mm, vma, 0, vaddr);
+               pte = huge_pte_offset(mm, vaddr);
+           }
+
            /* hugetlb should be locked, and hence, prefaulted */
            WARN_ON(!pte || pte_none(*pte));
 
@@ -261,12 +268,17 @@ int pmd_huge(pmd_t pmd)
 }
 
 struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-       pmd_t *pmd, int write)
+follow_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+       unsigned long address, pmd_t *pmd, int write)
 {
    struct page *page;
 
    page = pte_page(*(pte_t *)pmd);
+
+   if (!page) {
+       hugetlb_fault(mm, vma, write, address);
+       page = pte_page(*(pte_t *)pmd);
+   }
    if (page) {
        page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
        get_page(page);
@@ -527,6 +539,48 @@ int is_hugepage_mem_enough(size_t size)
    return (size + ~HPAGE_MASK)/HPAGE_SIZE <= htlbpagemem;
 }
 
+
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int write_access, unsigned long address)
+{
+   struct file *file = vma->vm_file;
+   struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
+   struct page *page;
+   unsigned long idx;
+   pte_t *pte;
+   int ret = VM_FAULT_MINOR;
+
+   BUG_ON(vma->vm_start & ~HPAGE_MASK);
+   BUG_ON(vma->vm_end & ~HPAGE_MASK);
+
+   spin_lock(&mm->page_table_lock);
+
+   idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+       + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+   page = find_get_page(mapping, idx);
+
+   if (!page) {
+       page = alloc_hugetlb_page();
+       if (!page) {
+           ret = VM_FAULT_SIGBUS;
+           goto out;
+       }
+       ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+       unlock_page(page);
+       if (ret) {
+           free_huge_page(page);
+           ret = VM_FAULT_SIGBUS;
+           goto out;
+       }
+   }
+   pte = huge_pte_alloc(mm, address);
+   set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+/*     update_mmu_cache(vma, address, *pte); */
+out:
+   spin_unlock(&mm->page_table_lock);
+   return ret;
+}
+
+
 /*
  * We cannot handle pagefaults against hugetlb pages at all.  They cause
  * handle_mm_fault() to try to instantiate regular-sized pages in the
diff -dupr linux-2.6.0/arch/ia64/mm/hugetlbpage.c linux-2.6.0.HugeTLB/arch/ia64/mm/hugetlbpage.c
--- linux-2.6.0/arch/ia64/mm/hugetlbpage.c  2003-12-18 11:58:56.000000000 +0900
+++ linux-2.6.0.HugeTLB/arch/ia64/mm/hugetlbpage.c  2004-03-22 11:29:01.000000000 +0900
@@ -170,8 +170,10 @@ int copy_hugetlb_page_range(struct mm_st
            goto nomem;
        src_pte = huge_pte_offset(src, addr);
        entry = *src_pte;
-       ptepage = pte_page(entry);
-       get_page(ptepage);
+       if (!pte_none(entry)) {
+           ptepage = pte_page(entry);
+           get_page(ptepage);
+       }   
        set_pte(dst_pte, entry);
        dst->rss += (HPAGE_SIZE / PAGE_SIZE);
        addr += HPAGE_SIZE;
@@ -195,6 +197,12 @@ follow_hugetlb_page(struct mm_struct *mm
    do {
        pstart = start & HPAGE_MASK;
        ptep = huge_pte_offset(mm, start);
+
+       if (!ptep || pte_none(*ptep)) {
+           hugetlb_fault(mm, vma, 0, start);
+           ptep = huge_pte_offset(mm, start);
+       }
+
        pte = *ptep;
 
 back1:
@@ -236,6 +244,12 @@ struct page *follow_huge_addr(struct mm_
    pte_t *ptep;
 
    ptep = huge_pte_offset(mm, addr);
+
+   if (!ptep || pte_none(*ptep)) {
+       hugetlb_fault(mm, vma, 0, addr);
+       ptep = huge_pte_offset(mm, addr);
+   }
+
    page = pte_page(*ptep);
    page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
    get_page(page);
@@ -246,7 +260,8 @@ int pmd_huge(pmd_t pmd)
    return 0;
 }
 struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write)
+follow_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+       unsigned long address, pmd_t *pmd, int write)
 {
    return NULL;
 }
@@ -518,6 +533,49 @@ int is_hugepage_mem_enough(size_t size)
    return 1;
 }
 
+
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int write_access, unsigned long address)
+{
+   struct file *file = vma->vm_file;
+   struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
+   struct page *page;
+   unsigned long idx;
+   pte_t *pte;
+   int ret = VM_FAULT_MINOR;
+
+   BUG_ON(vma->vm_start & ~HPAGE_MASK);
+   BUG_ON(vma->vm_end & ~HPAGE_MASK);
+
+   spin_lock(&mm->page_table_lock);
+
+   idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+       + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+   page = find_get_page(mapping, idx);
+
+   if (!page) {
+       page = alloc_hugetlb_page();
+       if (!page) {
+           ret = VM_FAULT_SIGBUS;
+           goto out;
+       }
+       ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+       unlock_page(page);
+       if (ret) {
+           free_huge_page(page);
+           ret = VM_FAULT_SIGBUS;
+           goto out;
+       }
+   }
+   pte = huge_pte_alloc(mm, address);
+   set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+   flush_tlb_range(vma, address, address + HPAGE_SIZE);
+   update_mmu_cache(vma, address, *pte);
+out:
+   spin_unlock(&mm->page_table_lock);
+   return ret;
+}
+
+
 static struct page *hugetlb_nopage(struct vm_area_struct * area, unsigned long address, int unused)
 {
    BUG();
diff -dupr linux-2.6.0/include/linux/hugetlb.h linux-2.6.0.HugeTLB/include/linux/hugetlb.h
--- linux-2.6.0/include/linux/hugetlb.h 2003-12-18 11:58:49.000000000 +0900
+++ linux-2.6.0.HugeTLB/include/linux/hugetlb.h 2003-12-19 09:47:25.000000000 +0900
@@ -23,10 +23,12 @@ struct page *follow_huge_addr(struct mm_
            unsigned long address, int write);
 struct vm_area_struct *hugepage_vma(struct mm_struct *mm,
                    unsigned long address);
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-               pmd_t *pmd, int write);
+struct page *follow_huge_pmd(struct mm_struct *mm, struct vm_area_struct *,
+               unsigned long address, pmd_t *pmd, int write);
 int is_aligned_hugepage_range(unsigned long addr, unsigned long len);
 int pmd_huge(pmd_t pmd);
+extern int hugetlb_fault(struct mm_struct *, struct vm_area_struct *,
+               int, unsigned long);
 
 extern int htlbpage_max;
 
@@ -63,6 +65,7 @@ static inline int is_vm_hugetlb_page(str
 #define is_aligned_hugepage_range(addr, len)   0
 #define pmd_huge(x)    0
 #define is_hugepage_only_range(addr, len)  0
+#define hugetlb_fault(mm, vma, write, addr)    0
 
 #ifndef HPAGE_MASK
 #define HPAGE_MASK 0       /* Keep the compiler happy */
diff -dupr linux-2.6.0/mm/memory.c linux-2.6.0.HugeTLB/mm/memory.c
--- linux-2.6.0/mm/memory.c 2003-12-18 11:58:48.000000000 +0900
+++ linux-2.6.0.HugeTLB/mm/memory.c 2003-12-19 09:47:46.000000000 +0900
@@ -640,7 +640,7 @@ follow_page(struct mm_struct *mm, unsign
    if (pmd_none(*pmd))
        goto out;
    if (pmd_huge(*pmd))
-       return follow_huge_pmd(mm, address, pmd, write);
+       return follow_huge_pmd(mm, vma, address, pmd, write);
    if (pmd_bad(*pmd))
        goto out;
 
@@ -1603,7 +1603,7 @@ int handle_mm_fault(struct mm_struct *mm
    inc_page_state(pgfault);
 
    if (is_vm_hugetlb_page(vma))
-       return VM_FAULT_SIGBUS; /* mapping truncation does this. */
+       return hugetlb_fault(mm, vma, write_access, address);
 
    /*
     * We need the page table lock to synchronize with kswapd

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-13  3:44 Ray Bryant
                   ` (2 preceding siblings ...)
  2004-03-13  4:56 ` Hirokazu Takahashi
@ 2004-03-15 15:28 ` jlnance
  3 siblings, 0 replies; 14+ messages in thread
From: jlnance @ 2004-03-15 15:28 UTC (permalink / raw)
  To: linux-kernel

On Fri, Mar 12, 2004 at 09:44:03PM -0600, Ray Bryant wrote:
> We've run into a scaling problem using hugetlbpages in very large memory 
> machines, e. g. machines with 1TB or more of main memory.

You know, when I started using Linux it wouldn't support more than 16M
of ram.  No one complained because no one using Linux had a machine with
more than 16M of ram.  It looks like things have progressed a bit since
then :-)

Jim

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2004-04-01  9:12 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-13  3:45 Hugetlbpages in very large memory machines Ray Bryant
  -- strict thread matches above, loose matches on Subject: below --
2004-03-13  3:44 Ray Bryant
2004-03-13  3:48 ` Andi Kleen
2004-03-13  5:49   ` William Lee Irwin III
2004-03-14  2:45   ` Andrew Morton
2004-03-13  3:55 ` William Lee Irwin III
2004-03-13  4:56 ` Hirokazu Takahashi
2004-03-16  0:30   ` Nobuhiko Yoshida
2004-03-16  1:54     ` Andi Kleen
2004-03-16  2:32       ` Hirokazu Takahashi
2004-03-16  3:20         ` Hirokazu Takahashi
2004-03-16  3:15       ` Nobuhiko Yoshida
2004-04-01  9:10         ` Nobuhiko Yoshida
2004-03-15 15:28 ` jlnance

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox