[UPDATE] [PATCH 0/3] Demand faulting for hugetlb

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [UPDATE] [PATCH 0/3] Demand faulting for hugetlb
@ 2005-09-08 21:15 Adam Litke
  2005-09-08 21:18 ` [PATCH 1/3 htlb-get_user_pages] " Adam Litke
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Adam Litke @ 2005-09-08 21:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: agl

Sending this out again after incorporating the feedback from Yanmin
Zhang and Dave Hansen.  Are we ready to go for -mm yet?  The only patch
which has seen any recent changes is #2 below so I assume people agree
with #1 and #3 now :)

The three patches:
 1) Remove a get_user_pages() optimization that is no longer valid for
demand faulting
 2) Move fault logic from hugetlb_prefault() to hugetlb_pte_fault()
 3) Apply a simple overcommit check so demand fault accounting behaves
in a manner in line with how prefault worked

Diffed against 2.6.13-git6

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3 htlb-get_user_pages] Demand faulting for hugetlb
  2005-09-08 21:15 [UPDATE] [PATCH 0/3] Demand faulting for hugetlb Adam Litke
@ 2005-09-08 21:18 ` Adam Litke
  2005-09-08 21:20 ` [PATCH 2/3 htlb-fault] " Adam Litke
  2005-09-08 21:21 ` [PATCH 3/3 htlb-acct] " Adam Litke
  2 siblings, 0 replies; 8+ messages in thread
From: Adam Litke @ 2005-09-08 21:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: agl

Initial Post (Thu, 18 Aug 2005)

In preparation for hugetlb demand faulting, remove this get_user_pages()
optimization.  Since huge pages will no longer be prefaulted, we can't assume
that the huge ptes are established and hence, calling follow_hugetlb_page() is
not valid.

With the follow_hugetlb_page() call removed, the normal code path will be
triggered.  follow_page() will either use follow_huge_addr() or
follow_huge_pmd() to check for a previously faulted "page" to return.  When
this fails (ie. with demand faults), __handle_mm_fault() gets called which
invokes the hugetlb_fault() handler to instantiate the huge page.

This patch doesn't make a lot of sense by itself, but I've broken it out to
facilitate discussion on this specific element of the demand fault changes.
While coding this up, I referenced previous discussion on this topic starting
at http://lkml.org/lkml/2004/4/13/176 , which contains more opinions about the
correctness of this approach.

Diffed against 2.6.13-git6

Signed-off-by: Adam Litke <agl@us.ibm.com>

---
 memory.c |    5 -----
 1 files changed, 5 deletions(-)
diff -upN reference/mm/memory.c current/mm/memory.c
--- reference/mm/memory.c
+++ current/mm/memory.c
@@ -949,11 +949,6 @@ int get_user_pages(struct task_struct *t
 				|| !(flags & vma->vm_flags))
 			return i ? : -EFAULT;

-		if (is_vm_hugetlb_page(vma)) {
-			i = follow_hugetlb_page(mm, vma, pages, vmas,
-						&start, &len, i);
-			continue;
-		}
 		spin_lock(&mm->page_table_lock);
 		do {
 			int write_access = write;

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3 htlb-fault] Demand faulting for hugetlb
  2005-09-08 21:15 [UPDATE] [PATCH 0/3] Demand faulting for hugetlb Adam Litke
  2005-09-08 21:18 ` [PATCH 1/3 htlb-get_user_pages] " Adam Litke
@ 2005-09-08 21:20 ` Adam Litke
  2005-09-08 21:21 ` [PATCH 3/3 htlb-acct] " Adam Litke
  2 siblings, 0 replies; 8+ messages in thread
From: Adam Litke @ 2005-09-08 21:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: agl

Version 3 (Thu, 08 Aug 2005)
	Organized logic in hugetlb_pte_fault() by breaking out
	  find_get_page/alloc_huge_page logic into separate function
	Removed a few more paranoid checks <
	Fixed tlb flushing in a race case  < (thanks Yanmin Zhang)

Version 2 (Wed, 17 Aug 2005)
        Removed spurious WARN_ON()
    Patches added earlier in the series:
        Check for p?d_none() in arch/i386/mm/hugetlbpage.c:huge_pte_offset()
	Move i386 stale pte check into huge_pte_alloc()

Initial Post (Fri, 05 Aug 2005)

Below is a patch to implement demand faulting for huge pages.  The main
motivation for changing from prefaulting to demand faulting is so that
huge page memory areas can be allocated according to NUMA policy.

Thanks to consolidated hugetlb code, switching the behavior requires changing
only one fault handler.  The bulk of the patch just moves the logic from 
hugelb_prefault() to hugetlb_pte_fault().

Diffed against 2.6.13-git6

Signed-off-by: Adam Litke <agl@us.ibm.com>
---
 fs/hugetlbfs/inode.c    |    6 -
 include/linux/hugetlb.h |    2 
 mm/hugetlb.c            |  154 +++++++++++++++++++++++++++++-------------------
 mm/memory.c             |    2 
 4 files changed, 98 insertions(+), 66 deletions(-)
diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c
--- reference/fs/hugetlbfs/inode.c
+++ current/fs/hugetlbfs/inode.c
@@ -48,7 +48,6 @@ int sysctl_hugetlb_shm_group;
 static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct inode *inode = file->f_dentry->d_inode;
-	struct address_space *mapping = inode->i_mapping;
 	loff_t len, vma_len;
 	int ret;
 
@@ -79,10 +78,7 @@ static int hugetlbfs_file_mmap(struct fi
 	if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
 		goto out;
 
-	ret = hugetlb_prefault(mapping, vma);
-	if (ret)
-		goto out;
-
+	ret = 0;
 	if (inode->i_size < len)
 		inode->i_size = len;
 out:
diff -upN reference/include/linux/hugetlb.h current/include/linux/hugetlb.h
--- reference/include/linux/hugetlb.h
+++ current/include/linux/hugetlb.h
@@ -25,6 +25,8 @@ int is_hugepage_mem_enough(size_t);
 unsigned long hugetlb_total_pages(void);
 struct page *alloc_huge_page(void);
 void free_huge_page(struct page *);
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct * vma,
+			unsigned long address, int write_access);
 
 extern unsigned long max_huge_pages;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c
--- reference/mm/hugetlb.c
+++ current/mm/hugetlb.c
@@ -274,21 +274,22 @@ int copy_hugetlb_page_range(struct mm_st
 {
 	pte_t *src_pte, *dst_pte, entry;
 	struct page *ptepage;
-	unsigned long addr = vma->vm_start;
+	unsigned long addr;
 	unsigned long end = vma->vm_end;
 
-	while (addr < end) {
+	for (addr = vma->vm_start; addr < end; addr += HPAGE_SIZE) {
+		src_pte = huge_pte_offset(src, addr);
+		if (!src_pte || pte_none(*src_pte))
+			continue;
+		
 		dst_pte = huge_pte_alloc(dst, addr);
 		if (!dst_pte)
 			goto nomem;
-		src_pte = huge_pte_offset(src, addr);
-		BUG_ON(!src_pte || pte_none(*src_pte)); /* prefaulted */
 		entry = *src_pte;
 		ptepage = pte_page(entry);
 		get_page(ptepage);
 		add_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
 		set_huge_pte_at(dst, addr, dst_pte, entry);
-		addr += HPAGE_SIZE;
 	}
 	return 0;
 
@@ -338,61 +339,6 @@ void zap_hugepage_range(struct vm_area_s
 	spin_unlock(&mm->page_table_lock);
 }
 
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
-	struct mm_struct *mm = current->mm;
-	unsigned long addr;
-	int ret = 0;
-
-	WARN_ON(!is_vm_hugetlb_page(vma));
-	BUG_ON(vma->vm_start & ~HPAGE_MASK);
-	BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
-	hugetlb_prefault_arch_hook(mm);
-
-	spin_lock(&mm->page_table_lock);
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
-		unsigned long idx;
-		pte_t *pte = huge_pte_alloc(mm, addr);
-		struct page *page;
-
-		if (!pte) {
-			ret = -ENOMEM;
-			goto out;
-		}
-
-		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
-			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
-		page = find_get_page(mapping, idx);
-		if (!page) {
-			/* charge the fs quota first */
-			if (hugetlb_get_quota(mapping)) {
-				ret = -ENOMEM;
-				goto out;
-			}
-			page = alloc_huge_page();
-			if (!page) {
-				hugetlb_put_quota(mapping);
-				ret = -ENOMEM;
-				goto out;
-			}
-			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
-			if (! ret) {
-				unlock_page(page);
-			} else {
-				hugetlb_put_quota(mapping);
-				free_huge_page(page);
-				goto out;
-			}
-		}
-		add_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
-		set_huge_pte_at(mm, addr, pte, make_huge_pte(vma, page));
-	}
-out:
-	spin_unlock(&mm->page_table_lock);
-	return ret;
-}
-
 int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			struct page **pages, struct vm_area_struct **vmas,
 			unsigned long *position, int *length, int i)
@@ -440,3 +386,91 @@ int follow_hugetlb_page(struct mm_struct
 
 	return i;
 }
+
+static struct page *find_get_huge_page(struct address_space *mapping,
+			unsigned long idx)
+{
+	struct page *page = NULL;
+
+retry:
+	page = find_get_page(mapping, idx);
+	if (page)
+		goto out;
+
+	if (hugetlb_get_quota(mapping))
+		goto out;
+	page = alloc_huge_page();
+	if (!page) {
+		hugetlb_put_quota(mapping);
+		goto out;
+	}
+
+	if (add_to_page_cache(page, mapping, idx, GFP_ATOMIC)) {
+		put_page(page);
+		hugetlb_put_quota(mapping);
+		goto retry;
+	}
+	unlock_page(page);
+out:
+	return page;
+}
+
+static int hugetlb_pte_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, int write_access)
+{
+	int ret = VM_FAULT_MINOR;
+	unsigned long idx;
+	pte_t *pte;
+	struct page *page;
+	struct address_space *mapping;
+
+	BUG_ON(vma->vm_start & ~HPAGE_MASK);
+	BUG_ON(vma->vm_end & ~HPAGE_MASK);
+	BUG_ON(!vma->vm_file);
+
+	pte = huge_pte_offset(mm, address);
+	if (!pte) {
+		ret = VM_FAULT_SIGBUS;
+		goto out;
+	}
+	if (!pte_none(*pte))
+		goto out;
+
+	mapping = vma->vm_file->f_mapping;
+	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+
+	page = find_get_huge_page(mapping, idx);
+	if (!page) {
+		ret = VM_FAULT_SIGBUS;
+		goto out;
+	}
+
+	add_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
+	set_huge_pte_at(mm, address, pte, make_huge_pte(vma, page));
+out:
+	return ret;
+}
+
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, int write_access)
+{
+	pte_t *ptep;
+	int rc = VM_FAULT_MINOR;
+
+	spin_lock(&mm->page_table_lock);
+
+	ptep = huge_pte_alloc(mm, address);
+	if (!ptep) {
+		rc = VM_FAULT_SIGBUS;
+		goto out;
+	}
+	if (pte_none(*ptep))
+		rc = hugetlb_pte_fault(mm, vma, address, write_access);
+out:
+	if (rc == VM_FAULT_MINOR)
+		flush_tlb_page(vma, address);
+
+	spin_unlock(&mm->page_table_lock);
+	return rc;
+}
diff -upN reference/mm/memory.c current/mm/memory.c
--- reference/mm/memory.c
+++ current/mm/memory.c
@@ -2041,7 +2041,7 @@ int __handle_mm_fault(struct mm_struct *
 	inc_page_state(pgfault);
 
 	if (is_vm_hugetlb_page(vma))
-		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
+		return hugetlb_fault(mm, vma, address, write_access);
 
 	/*
 	 * We need the page table lock to synchronize with kswapd

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3 htlb-acct] Demand faulting for hugetlb
  2005-09-08 21:15 [UPDATE] [PATCH 0/3] Demand faulting for hugetlb Adam Litke
  2005-09-08 21:18 ` [PATCH 1/3 htlb-get_user_pages] " Adam Litke
  2005-09-08 21:20 ` [PATCH 2/3 htlb-fault] " Adam Litke
@ 2005-09-08 21:21 ` Adam Litke
  2 siblings, 0 replies; 8+ messages in thread
From: Adam Litke @ 2005-09-08 21:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: agl

Initial Post (Thu, 18 Aug 2005)

Basic overcommit checking for hugetlb_file_map() based on an implementation
used with demand faulting in SLES9.

Since demand faulting can't guarantee the availability of pages at mmap time,
this patch implements a basic sanity check to ensure that the number of huge
pages required to satisfy the mmap are currently available.  Despite the
obvious race, I think it is a good start on doing proper accounting.  I'd like
to work towards an accounting system that mimics the semantics of normal pages
(especially for the MAP_PRIVATE/COW case).  That work is underway and builds on
what this patch starts.

Huge page shared memory segments are simpler and still maintain their commit on
shmget semantics.

Diffed against 2.6.13-git6

Signed-off-by: Adam Litke <agl@us.ibm.com>
---
 inode.c |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 47 insertions(+)
diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c
--- reference/fs/hugetlbfs/inode.c
+++ current/fs/hugetlbfs/inode.c
@@ -45,9 +45,51 @@ static struct backing_dev_info hugetlbfs
 
 int sysctl_hugetlb_shm_group;
 
+static void huge_pagevec_release(struct pagevec *pvec);
+
+unsigned long
+huge_pages_needed(struct address_space *mapping, struct vm_area_struct *vma)
+{
+	int i;
+	struct pagevec pvec;
+	unsigned long start = vma->vm_start;
+	unsigned long end = vma->vm_end;
+	unsigned long hugepages = (end - start) >> HPAGE_SHIFT;
+	pgoff_t next = vma->vm_pgoff;
+	pgoff_t endpg = next + ((end - start) >> PAGE_SHIFT);
+	struct inode *inode = vma->vm_file->f_dentry->d_inode;
+
+	/*
+	 * Shared memory segments are accounted for at shget time,
+	 * not at shmat (when the mapping is actually created) so 
+	 * check here if the memory has already been accounted for.
+	 */
+	if (inode->i_blocks != 0)
+		return 0;
+
+	pagevec_init(&pvec, 0);
+	while (next < endpg) {
+		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE))
+			break;
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *page = pvec.pages[i];
+			if (page->index > next)
+				next = page->index;
+			if (page->index >= endpg)
+				break;
+			next++;
+			hugepages--;
+		}
+		huge_pagevec_release(&pvec);
+	}
+	return hugepages << HPAGE_SHIFT;
+}
+
 static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct inode *inode = file->f_dentry->d_inode;
+	struct address_space *mapping = inode->i_mapping;
+	unsigned long bytes;
 	loff_t len, vma_len;
 	int ret;
 
@@ -66,6 +108,10 @@ static int hugetlbfs_file_mmap(struct fi
 	if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
 		return -EINVAL;
 
+	bytes = huge_pages_needed(mapping, vma);
+	if (!is_hugepage_mem_enough(bytes))
+		return -ENOMEM;
+
 	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
 
 	down(&inode->i_sem);
@@ -794,6 +840,7 @@ struct file *hugetlb_zero_setup(size_t s
 	d_instantiate(dentry, inode);
 	inode->i_size = size;
 	inode->i_nlink = 0;
+	inode->i_blocks = 1;
 	file->f_vfsmnt = mntget(hugetlbfs_vfsmount);
 	file->f_dentry = dentry;
 	file->f_mapping = inode->i_mapping;

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH 2/3 htlb-fault] Demand faulting for hugetlb
@ 2005-09-07  3:13 Zhang, Yanmin
  0 siblings, 0 replies; 8+ messages in thread
From: Zhang, Yanmin @ 2005-09-07  3:13 UTC (permalink / raw)
  To: Adam Litke, linux-kernel

More comments below.

>>-----Original Message-----
>>From: linux-kernel-owner@vger.kernel.org
>>[mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Adam Litke
>>Sent: Wednesday, September 07, 2005 5:59 AM
>>To: linux-kernel@vger.kernel.org
>>Cc: ADAM G. LITKE [imap]
>>Subject: Re: [PATCH 2/3 htlb-fault] Demand faulting for hugetlb
>>
>>Below is a patch to implement demand faulting for huge pages.  The
main
>>motivation for changing from prefaulting to demand faulting is so that
>>huge page memory areas can be allocated according to NUMA policy.

>>@@ -277,18 +277,20 @@ int copy_hugetlb_page_range(struct mm_st
>> 	unsigned long addr = vma->vm_start;
>> 	unsigned long end = vma->vm_end;
>>
>>-	while (addr < end) {
>>+	for (; addr < end; addr += HPAGE_SIZE) {
>>+		src_pte = huge_pte_offset(src, addr);
>>+		if (!src_pte || pte_none(*src_pte))
>>+			continue;
>>+
>> 		dst_pte = huge_pte_alloc(dst, addr);
>> 		if (!dst_pte)
>> 			goto nomem;
>>-		src_pte = huge_pte_offset(src, addr);
>>-		BUG_ON(!src_pte || pte_none(*src_pte)); /* prefaulted */
>>+		BUG_ON(!src_pte);
Should this BUG_ON be deleted?

>> 		entry = *src_pte;
>> 		ptepage = pte_page(entry);
>> 		get_page(ptepage);
>> 		add_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
>> 		set_huge_pte_at(dst, addr, dst_pte, entry);
>>-		addr += HPAGE_SIZE;
>> 	}
>> 	return 0;
>>


>>+int hugetlb_pte_fault(struct mm_struct *mm, struct vm_area_struct
*vma,
>>+			unsigned long address, int write_access)
>>+{
>>+	int ret = VM_FAULT_MINOR;
>>+	unsigned long idx;
>>+	pte_t *pte;
>>+	struct page *page;
>>+	struct address_space *mapping;
>>+
>>+	BUG_ON(vma->vm_start & ~HPAGE_MASK);
>>+	BUG_ON(vma->vm_end & ~HPAGE_MASK);
>>+	BUG_ON(!vma->vm_file);
>>+
>>+	pte = huge_pte_alloc(mm, address);
Why to call huge_pte_alloc again? Hugetlb_fault already calls it.


>>+	if (!pte) {
>>+		ret = VM_FAULT_SIGBUS;
>>+		goto out;
>>+	}
>>+	if (! pte_none(*pte))
>>+		goto flush;
>>+
>>+	mapping = vma->vm_file->f_mapping;
>>+	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
>>+		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
>>+retry:
>>+	page = find_get_page(mapping, idx);
>>+	if (!page) {
>>+		/* charge the fs quota first */
>>+		if (hugetlb_get_quota(mapping)) {
>>+			ret = VM_FAULT_SIGBUS;
>>+			goto out;
>>+		}
>>+		page = alloc_huge_page();
>>+		if (!page) {
>>+			hugetlb_put_quota(mapping);
>>+			ret = VM_FAULT_SIGBUS;
>>+			goto out;
>>+		}
>>+		if (add_to_page_cache(page, mapping, idx, GFP_ATOMIC)) {
>>+			put_page(page);
>>+			goto retry;
>>+		}
>>+		unlock_page(page);
>>+	}
>>+	add_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
>>+	set_huge_pte_at(mm, address, pte, make_huge_pte(vma, page));
>>+flush:
>>+	flush_tlb_page(vma, address);
>>+out:
>>+	return ret;
>>+}
>>+
>>+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>+			unsigned long address, int write_access)
>>+{
>>+	pte_t *ptep;
>>+	int rc = VM_FAULT_MINOR;
>>+
>>+	spin_lock(&mm->page_table_lock);
>>+
>>+	ptep = huge_pte_alloc(mm, address & HPAGE_MASK);
The alignment is not needed. How about change it to ptep =
huge_pte_alloc(mm, address)?

>>+	if (! ptep) {
>>+		rc = VM_FAULT_SIGBUS;
>>+		goto out;
>>+	}
>>+	if (pte_none(*ptep))
>>+		rc = hugetlb_pte_fault(mm, vma, address, write_access);
In function hugetlb_pte_fault, if(!pte_none(*ptep)), tlb page will be
flushed, but here is doesn't. Why?

>>+out:
>>+	spin_unlock(&mm->page_table_lock);
>>+	return rc;
>>+}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH 2/3 htlb-fault] Demand faulting for hugetlb
@ 2005-09-07  2:33 Zhang, Yanmin
  2005-09-07 14:04 ` Adam Litke
  0 siblings, 1 reply; 8+ messages in thread
From: Zhang, Yanmin @ 2005-09-07  2:33 UTC (permalink / raw)
  To: Adam Litke, linux-kernel

>>-----Original Message-----
>>From: linux-kernel-owner@vger.kernel.org
>>[mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Adam Litke
>>Sent: Wednesday, September 07, 2005 5:59 AM
>>To: linux-kernel@vger.kernel.org
>>Cc: ADAM G. LITKE [imap]
>>Subject: Re: [PATCH 2/3 htlb-fault] Demand faulting for hugetlb

>>+retry:
>>+	page = find_get_page(mapping, idx);
>>+	if (!page) {
>>+		/* charge the fs quota first */
>>+		if (hugetlb_get_quota(mapping)) {
>>+			ret = VM_FAULT_SIGBUS;
>>+			goto out;
>>+		}
>>+		page = alloc_huge_page();
>>+		if (!page) {
>>+			hugetlb_put_quota(mapping);
>>+			ret = VM_FAULT_SIGBUS;
>>+			goto out;
>>+		}
>>+		if (add_to_page_cache(page, mapping, idx, GFP_ATOMIC)) {

Here you lost hugetlb_put_quota(mapping);


>>+			put_page(page);
>>+			goto retry;
>>+		}
>>+		unlock_page(page);

As for regular pages, kernel is used to unlock mm-> page_table_lock
before find_get_page and relock it before setting pte. Why isn't the
style followed by huge page fault?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH 2/3 htlb-fault] Demand faulting for hugetlb
  2005-09-07  2:33 Zhang, Yanmin
@ 2005-09-07 14:04 ` Adam Litke
  0 siblings, 0 replies; 8+ messages in thread
From: Adam Litke @ 2005-09-07 14:04 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: linux-kernel

On Wed, 2005-09-07 at 10:33 +0800, Zhang, Yanmin wrote:
> >>-----Original Message-----
> >>From: linux-kernel-owner@vger.kernel.org
> >>[mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Adam Litke
> >>Sent: Wednesday, September 07, 2005 5:59 AM
> >>To: linux-kernel@vger.kernel.org
> >>Cc: ADAM G. LITKE [imap]
> >>Subject: Re: [PATCH 2/3 htlb-fault] Demand faulting for hugetlb
> 
> >>+retry:
> >>+	page = find_get_page(mapping, idx);
> >>+	if (!page) {
> >>+		/* charge the fs quota first */
> >>+		if (hugetlb_get_quota(mapping)) {
> >>+			ret = VM_FAULT_SIGBUS;
> >>+			goto out;
> >>+		}
> >>+		page = alloc_huge_page();
> >>+		if (!page) {
> >>+			hugetlb_put_quota(mapping);
> >>+			ret = VM_FAULT_SIGBUS;
> >>+			goto out;
> >>+		}
> >>+		if (add_to_page_cache(page, mapping, idx, GFP_ATOMIC)) {
> 
> Here you lost hugetlb_put_quota(mapping);

Whoops, thanks for catching that.

> >>+			put_page(page);
> >>+			goto retry;
> >>+		}
> >>+		unlock_page(page);
> 
> As for regular pages, kernel is used to unlock mm-> page_table_lock
> before find_get_page and relock it before setting pte. Why isn't the
> style followed by huge page fault?

As far as I can tell, we should be able to do that for large pages as
well.  I'll give it a spin.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 0/3] Demand faulting for hugetlb
@ 2005-09-06 21:53 Adam Litke
  2005-09-06 21:58 ` [PATCH 2/3 htlb-fault] " Adam Litke
  0 siblings, 1 reply; 8+ messages in thread
From: Adam Litke @ 2005-09-06 21:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: ADAM G. LITKE [imap]

I am sending out the latest set of patches for hugetlb demand faulting.
I've incorporated all the feedback I've received from previous
discussions and I think this is ready for some more widespread testing.
Is anyone opposed to spinning this in -mm as it stands?

The three patches:
 1) Remove a get_user_pages() optimization that is no longer valid for
demand faulting
 2) Move fault logic from hugetlb_prefault() to hugetlb_pte_fault()
 3) Apply a simple overcommit check so demand fault accounting behaves
in a manner in line with how prefault worked

Diffed against 2.6.13-git6


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3 htlb-fault] Demand faulting for hugetlb
  2005-09-06 21:53 [PATCH 0/3] " Adam Litke
@ 2005-09-06 21:58 ` Adam Litke
  0 siblings, 0 replies; 8+ messages in thread
From: Adam Litke @ 2005-09-06 21:58 UTC (permalink / raw)
  To: linux-kernel; +Cc: ADAM G. LITKE [imap]

Below is a patch to implement demand faulting for huge pages.  The main
motivation for changing from prefaulting to demand faulting is so that
huge page memory areas can be allocated according to NUMA policy.

Thanks to consolidated hugetlb code, switching the behavior requires changing
only one fault handler.  The bulk of the patch just moves the logic from 
hugelb_prefault() to hugetlb_pte_fault().

Diffed against 2.6.13-git6

Signed-off-by: Adam Litke <agl@us.ibm.com>
---
 fs/hugetlbfs/inode.c    |    6 --
 include/linux/hugetlb.h |    2 
 mm/hugetlb.c            |  137 +++++++++++++++++++++++++++---------------------
 mm/memory.c             |    2 
 4 files changed, 82 insertions(+), 65 deletions(-)
diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c
--- reference/fs/hugetlbfs/inode.c
+++ current/fs/hugetlbfs/inode.c
@@ -48,7 +48,6 @@ int sysctl_hugetlb_shm_group;
 static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct inode *inode = file->f_dentry->d_inode;
-	struct address_space *mapping = inode->i_mapping;
 	loff_t len, vma_len;
 	int ret;
 
@@ -79,10 +78,7 @@ static int hugetlbfs_file_mmap(struct fi
 	if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
 		goto out;
 
-	ret = hugetlb_prefault(mapping, vma);
-	if (ret)
-		goto out;
-
+	ret = 0;
 	if (inode->i_size < len)
 		inode->i_size = len;
 out:
diff -upN reference/include/linux/hugetlb.h current/include/linux/hugetlb.h
--- reference/include/linux/hugetlb.h
+++ current/include/linux/hugetlb.h
@@ -25,6 +25,8 @@ int is_hugepage_mem_enough(size_t);
 unsigned long hugetlb_total_pages(void);
 struct page *alloc_huge_page(void);
 void free_huge_page(struct page *);
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct * vma,
+			unsigned long address, int write_access);
 
 extern unsigned long max_huge_pages;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c
--- reference/mm/hugetlb.c
+++ current/mm/hugetlb.c
@@ -277,18 +277,20 @@ int copy_hugetlb_page_range(struct mm_st
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
 
-	while (addr < end) {
+	for (; addr < end; addr += HPAGE_SIZE) {
+		src_pte = huge_pte_offset(src, addr);
+		if (!src_pte || pte_none(*src_pte))
+			continue;
+		
 		dst_pte = huge_pte_alloc(dst, addr);
 		if (!dst_pte)
 			goto nomem;
-		src_pte = huge_pte_offset(src, addr);
-		BUG_ON(!src_pte || pte_none(*src_pte)); /* prefaulted */
+		BUG_ON(!src_pte);
 		entry = *src_pte;
 		ptepage = pte_page(entry);
 		get_page(ptepage);
 		add_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
 		set_huge_pte_at(dst, addr, dst_pte, entry);
-		addr += HPAGE_SIZE;
 	}
 	return 0;
 
@@ -338,61 +340,6 @@ void zap_hugepage_range(struct vm_area_s
 	spin_unlock(&mm->page_table_lock);
 }
 
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
-	struct mm_struct *mm = current->mm;
-	unsigned long addr;
-	int ret = 0;
-
-	WARN_ON(!is_vm_hugetlb_page(vma));
-	BUG_ON(vma->vm_start & ~HPAGE_MASK);
-	BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
-	hugetlb_prefault_arch_hook(mm);
-
-	spin_lock(&mm->page_table_lock);
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
-		unsigned long idx;
-		pte_t *pte = huge_pte_alloc(mm, addr);
-		struct page *page;
-
-		if (!pte) {
-			ret = -ENOMEM;
-			goto out;
-		}
-
-		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
-			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
-		page = find_get_page(mapping, idx);
-		if (!page) {
-			/* charge the fs quota first */
-			if (hugetlb_get_quota(mapping)) {
-				ret = -ENOMEM;
-				goto out;
-			}
-			page = alloc_huge_page();
-			if (!page) {
-				hugetlb_put_quota(mapping);
-				ret = -ENOMEM;
-				goto out;
-			}
-			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
-			if (! ret) {
-				unlock_page(page);
-			} else {
-				hugetlb_put_quota(mapping);
-				free_huge_page(page);
-				goto out;
-			}
-		}
-		add_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
-		set_huge_pte_at(mm, addr, pte, make_huge_pte(vma, page));
-	}
-out:
-	spin_unlock(&mm->page_table_lock);
-	return ret;
-}
-
 int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			struct page **pages, struct vm_area_struct **vmas,
 			unsigned long *position, int *length, int i)
@@ -440,3 +387,75 @@ int follow_hugetlb_page(struct mm_struct
 
 	return i;
 }
+
+int hugetlb_pte_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, int write_access)
+{
+	int ret = VM_FAULT_MINOR;
+	unsigned long idx;
+	pte_t *pte;
+	struct page *page;
+	struct address_space *mapping;
+
+	BUG_ON(vma->vm_start & ~HPAGE_MASK);
+	BUG_ON(vma->vm_end & ~HPAGE_MASK);
+	BUG_ON(!vma->vm_file);
+
+	pte = huge_pte_alloc(mm, address);
+	if (!pte) {
+		ret = VM_FAULT_SIGBUS;
+		goto out;
+	}
+	if (! pte_none(*pte))
+		goto flush;
+
+	mapping = vma->vm_file->f_mapping;
+	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+retry:
+	page = find_get_page(mapping, idx);
+	if (!page) {
+		/* charge the fs quota first */
+		if (hugetlb_get_quota(mapping)) {
+			ret = VM_FAULT_SIGBUS;
+			goto out;
+		}
+		page = alloc_huge_page();
+		if (!page) {
+			hugetlb_put_quota(mapping);
+			ret = VM_FAULT_SIGBUS;
+			goto out;
+		}
+		if (add_to_page_cache(page, mapping, idx, GFP_ATOMIC)) {
+			put_page(page);
+			goto retry;
+		}
+		unlock_page(page);
+	}
+	add_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
+	set_huge_pte_at(mm, address, pte, make_huge_pte(vma, page));
+flush:
+	flush_tlb_page(vma, address);
+out:
+	return ret;
+}
+
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, int write_access)
+{
+	pte_t *ptep;
+	int rc = VM_FAULT_MINOR;
+
+	spin_lock(&mm->page_table_lock);
+
+	ptep = huge_pte_alloc(mm, address & HPAGE_MASK);
+	if (! ptep) {
+		rc = VM_FAULT_SIGBUS;
+		goto out;
+	}
+	if (pte_none(*ptep))
+		rc = hugetlb_pte_fault(mm, vma, address, write_access);
+out:
+	spin_unlock(&mm->page_table_lock);
+	return rc;
+}
diff -upN reference/mm/memory.c current/mm/memory.c
--- reference/mm/memory.c
+++ current/mm/memory.c
@@ -2041,7 +2041,7 @@ int __handle_mm_fault(struct mm_struct *
 	inc_page_state(pgfault);
 
 	if (is_vm_hugetlb_page(vma))
-		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
+		return hugetlb_fault(mm, vma, address, write_access);
 
 	/*
 	 * We need the page table lock to synchronize with kswapd



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-09-08 21:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-08 21:15 [UPDATE] [PATCH 0/3] Demand faulting for hugetlb Adam Litke
2005-09-08 21:18 ` [PATCH 1/3 htlb-get_user_pages] " Adam Litke
2005-09-08 21:20 ` [PATCH 2/3 htlb-fault] " Adam Litke
2005-09-08 21:21 ` [PATCH 3/3 htlb-acct] " Adam Litke
  -- strict thread matches above, loose matches on Subject: below --
2005-09-07  3:13 [PATCH 2/3 htlb-fault] " Zhang, Yanmin
2005-09-07  2:33 Zhang, Yanmin
2005-09-07 14:04 ` Adam Litke
2005-09-06 21:53 [PATCH 0/3] " Adam Litke
2005-09-06 21:58 ` [PATCH 2/3 htlb-fault] " Adam Litke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox