* [PATCH 0/4] hugetlb: copy on write @ 2005-11-09 23:28 ` Adam Litke 0 siblings, 0 replies; 28+ messages in thread From: Adam Litke @ 2005-11-09 23:28 UTC (permalink / raw) To: akpm Cc: linux-mm, linux-kernel, David Gibson, wli, hugh, rohit.seth, kenneth.w.chen, ADAM G. LITKE [imap] This is a resend of the patches I sent on Nov 7th. I've broken them out as requested. Comments (especially on the copy-on-write portion) appreciated. Does anyone have a fundamental objection to moving forward with these? remove-dup-isize-check - Remove duplicated i_size truncation race check rename-find_lock_huge_page - Switch to a more appropriate name hugetlb_no_page - Mild reorg to support multiple fault types htlb-cow - Copy on write support -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 0/4] hugetlb: copy on write @ 2005-11-09 23:28 ` Adam Litke 0 siblings, 0 replies; 28+ messages in thread From: Adam Litke @ 2005-11-09 23:28 UTC (permalink / raw) To: akpm Cc: linux-mm, linux-kernel, David Gibson, wli, hugh, rohit.seth, kenneth.w.chen, ADAM G. LITKE [imap] This is a resend of the patches I sent on Nov 7th. I've broken them out as requested. Comments (especially on the copy-on-write portion) appreciated. Does anyone have a fundamental objection to moving forward with these? remove-dup-isize-check - Remove duplicated i_size truncation race check rename-find_lock_huge_page - Switch to a more appropriate name hugetlb_no_page - Mild reorg to support multiple fault types htlb-cow - Copy on write support -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 1/4] Hugetlb: Remove duplicate i_size check 2005-11-09 23:28 ` Adam Litke @ 2005-11-09 23:36 ` Adam Litke -1 siblings, 0 replies; 28+ messages in thread From: Adam Litke @ 2005-11-09 23:36 UTC (permalink / raw) To: akpm Cc: linux-mm, linux-kernel, David Gibson, wli, hugh, rohit.seth, kenneth.w.chen, ADAM G. LITKE [imap] Hugetlb: Remove duplicate i_size check On Wed, 2005-10-26 at 12:00 +1000, David Gibson wrote: > - The check against i_size was duplicated: once in > find_lock_huge_page() and again in hugetlb_fault() after taking the > page_table_lock. We only really need the locked one, so remove the > other. Original post by David Gibson <david@gibson.dropbear.id.au> Version 2: Wed 9 Nov 2005 Split this cleanup out into a standalone patch Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> --- hugetlb.c | 7 ------- 1 files changed, 7 deletions(-) diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c --- reference/mm/hugetlb.c +++ current/mm/hugetlb.c @@ -344,19 +344,12 @@ static struct page *find_lock_huge_page( { struct page *page; int err; - struct inode *inode = mapping->host; - unsigned long size; retry: page = find_lock_page(mapping, idx); if (page) goto out; - /* Check to make sure the mapping hasn't been truncated */ - size = i_size_read(inode) >> HPAGE_SHIFT; - if (idx >= size) - goto out; - if (hugetlb_get_quota(mapping)) goto out; page = alloc_huge_page(); -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 1/4] Hugetlb: Remove duplicate i_size check @ 2005-11-09 23:36 ` Adam Litke 0 siblings, 0 replies; 28+ messages in thread From: Adam Litke @ 2005-11-09 23:36 UTC (permalink / raw) To: akpm Cc: linux-mm, linux-kernel, David Gibson, wli, hugh, rohit.seth, kenneth.w.chen, ADAM G. LITKE [imap] On Wed, 2005-10-26 at 12:00 +1000, David Gibson wrote: > - The check against i_size was duplicated: once in > find_lock_huge_page() and again in hugetlb_fault() after taking the > page_table_lock. We only really need the locked one, so remove the > other. Original post by David Gibson <david@gibson.dropbear.id.au> Version 2: Wed 9 Nov 2005 Split this cleanup out into a standalone patch Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> --- hugetlb.c | 7 ------- 1 files changed, 7 deletions(-) diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c --- reference/mm/hugetlb.c +++ current/mm/hugetlb.c @@ -344,19 +344,12 @@ static struct page *find_lock_huge_page( { struct page *page; int err; - struct inode *inode = mapping->host; - unsigned long size; retry: page = find_lock_page(mapping, idx); if (page) goto out; - /* Check to make sure the mapping hasn't been truncated */ - size = i_size_read(inode) >> HPAGE_SHIFT; - if (idx >= size) - goto out; - if (hugetlb_get_quota(mapping)) goto out; page = alloc_huge_page(); -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/4] Hugetlb: Remove duplicate i_size check 2005-11-09 23:36 ` Adam Litke @ 2005-11-10 0:10 ` William Lee Irwin III -1 siblings, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2005-11-10 0:10 UTC (permalink / raw) To: Adam Litke Cc: akpm, linux-mm, linux-kernel, David Gibson, hugh, rohit.seth, kenneth.w.chen On Wed, 2005-10-26 at 12:00 +1000, David Gibson wrote: >> - The check against i_size was duplicated: once in >> find_lock_huge_page() and again in hugetlb_fault() after taking the >> page_table_lock. We only really need the locked one, so remove the >> other. On Wed, Nov 09, 2005 at 05:36:49PM -0600, Adam Litke wrote: > Original post by David Gibson <david@gibson.dropbear.id.au> > Version 2: Wed 9 Nov 2005 > Split this cleanup out into a standalone patch > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Adam Litke <agl@us.ibm.com> Innocuous enough. Acked-by: William Irwin <wli@holomorphy.com> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/4] Hugetlb: Remove duplicate i_size check @ 2005-11-10 0:10 ` William Lee Irwin III 0 siblings, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2005-11-10 0:10 UTC (permalink / raw) To: Adam Litke Cc: akpm, linux-mm, linux-kernel, David Gibson, hugh, rohit.seth, kenneth.w.chen On Wed, 2005-10-26 at 12:00 +1000, David Gibson wrote: >> - The check against i_size was duplicated: once in >> find_lock_huge_page() and again in hugetlb_fault() after taking the >> page_table_lock. We only really need the locked one, so remove the >> other. On Wed, Nov 09, 2005 at 05:36:49PM -0600, Adam Litke wrote: > Original post by David Gibson <david@gibson.dropbear.id.au> > Version 2: Wed 9 Nov 2005 > Split this cleanup out into a standalone patch > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Adam Litke <agl@us.ibm.com> Innocuous enough. Acked-by: William Irwin <wli@holomorphy.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 2/4] Hugetlb: Rename find_lock_page to find_or_alloc_huge_page 2005-11-09 23:28 ` Adam Litke @ 2005-11-09 23:37 ` Adam Litke -1 siblings, 0 replies; 28+ messages in thread From: Adam Litke @ 2005-11-09 23:37 UTC (permalink / raw) To: akpm Cc: linux-mm, linux-kernel, David Gibson, wli, hugh, rohit.seth, kenneth.w.chen, ADAM G. LITKE [imap] Hugetlb: Rename find_lock_page to find_or_alloc_huge_page On Wed, 2005-10-26 at 12:00 +1000, David Gibson wrote: - find_lock_huge_page() isn't a great name, since it does extra things not analagous to find_lock_page(). Rename it find_or_alloc_huge_page() which is closer to the mark. Original post by David Gibson <david@gibson.dropbear.id.au> Version 2: Wed 9 Nov 2005 Split into a separate patch Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> --- hugetlb.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c --- reference/mm/hugetlb.c +++ current/mm/hugetlb.c @@ -339,8 +339,8 @@ void unmap_hugepage_range(struct vm_area flush_tlb_range(vma, start, end); } -static struct page *find_lock_huge_page(struct address_space *mapping, - unsigned long idx) +static struct page *find_or_alloc_huge_page(struct address_space *mapping, + unsigned long idx) { struct page *page; int err; @@ -392,7 +392,7 @@ int hugetlb_fault(struct mm_struct *mm, * Use page lock to guard against racing truncation * before we get page_table_lock. */ - page = find_lock_huge_page(mapping, idx); + page = find_or_alloc_huge_page(mapping, idx); if (!page) goto out; -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 2/4] Hugetlb: Rename find_lock_page to find_or_alloc_huge_page @ 2005-11-09 23:37 ` Adam Litke 0 siblings, 0 replies; 28+ messages in thread From: Adam Litke @ 2005-11-09 23:37 UTC (permalink / raw) To: akpm Cc: linux-mm, linux-kernel, David Gibson, wli, hugh, rohit.seth, kenneth.w.chen, ADAM G. LITKE [imap] On Wed, 2005-10-26 at 12:00 +1000, David Gibson wrote: - find_lock_huge_page() isn't a great name, since it does extra things not analagous to find_lock_page(). Rename it find_or_alloc_huge_page() which is closer to the mark. Original post by David Gibson <david@gibson.dropbear.id.au> Version 2: Wed 9 Nov 2005 Split into a separate patch Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> --- hugetlb.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c --- reference/mm/hugetlb.c +++ current/mm/hugetlb.c @@ -339,8 +339,8 @@ void unmap_hugepage_range(struct vm_area flush_tlb_range(vma, start, end); } -static struct page *find_lock_huge_page(struct address_space *mapping, - unsigned long idx) +static struct page *find_or_alloc_huge_page(struct address_space *mapping, + unsigned long idx) { struct page *page; int err; @@ -392,7 +392,7 @@ int hugetlb_fault(struct mm_struct *mm, * Use page lock to guard against racing truncation * before we get page_table_lock. */ - page = find_lock_huge_page(mapping, idx); + page = find_or_alloc_huge_page(mapping, idx); if (!page) goto out; -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/4] Hugetlb: Rename find_lock_page to find_or_alloc_huge_page 2005-11-09 23:37 ` Adam Litke @ 2005-11-10 0:11 ` William Lee Irwin III -1 siblings, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2005-11-10 0:11 UTC (permalink / raw) To: Adam Litke Cc: akpm, linux-mm, linux-kernel, David Gibson, hugh, rohit.seth, kenneth.w.chen On Wed, Nov 09, 2005 at 05:37:52PM -0600, Adam Litke wrote: > Hugetlb: Rename find_lock_page to find_or_alloc_huge_page > On Wed, 2005-10-26 at 12:00 +1000, David Gibson wrote: > - find_lock_huge_page() isn't a great name, since it does extra things > not analagous to find_lock_page(). Rename it > find_or_alloc_huge_page() which is closer to the mark. > Original post by David Gibson <david@gibson.dropbear.id.au> > Version 2: Wed 9 Nov 2005 > Split into a separate patch > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Adam Litke <agl@us.ibm.com> Also innocuous. Acked-by: William Irwin <wli@holomorphy.com> -- wli ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/4] Hugetlb: Rename find_lock_page to find_or_alloc_huge_page @ 2005-11-10 0:11 ` William Lee Irwin III 0 siblings, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2005-11-10 0:11 UTC (permalink / raw) To: Adam Litke Cc: akpm, linux-mm, linux-kernel, David Gibson, hugh, rohit.seth, kenneth.w.chen On Wed, Nov 09, 2005 at 05:37:52PM -0600, Adam Litke wrote: > Hugetlb: Rename find_lock_page to find_or_alloc_huge_page > On Wed, 2005-10-26 at 12:00 +1000, David Gibson wrote: > - find_lock_huge_page() isn't a great name, since it does extra things > not analagous to find_lock_page(). Rename it > find_or_alloc_huge_page() which is closer to the mark. > Original post by David Gibson <david@gibson.dropbear.id.au> > Version 2: Wed 9 Nov 2005 > Split into a separate patch > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Adam Litke <agl@us.ibm.com> Also innocuous. Acked-by: William Irwin <wli@holomorphy.com> -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 3/4] Hugetlb: Reorganize hugetlb_fault to prepare for COW 2005-11-09 23:28 ` Adam Litke @ 2005-11-09 23:38 ` Adam Litke -1 siblings, 0 replies; 28+ messages in thread From: Adam Litke @ 2005-11-09 23:38 UTC (permalink / raw) To: akpm Cc: linux-mm, linux-kernel, David Gibson, wli, hugh, rohit.seth, kenneth.w.chen, ADAM G. LITKE [imap] Hugetlb: Reorganize hugetlb_fault to prepare for COW This patch splits the "no_page()" type activity into its own function, hugetlb_no_page(). hugetlb_fault() becomes the entry point for hugetlb faults and delegates to the appropriate handler depending on the type of fault. Right now we still have only hugetlb_no_page() but a later patch introduces a COW fault. Original post by David Gibson <david@gibson.dropbear.id.au> Version 2: Wed 9 Nov 2005 Broken out into a separate patch Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> --- hugetlb.c | 34 +++++++++++++++++++++++++--------- 1 files changed, 25 insertions(+), 9 deletions(-) diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c --- reference/mm/hugetlb.c +++ current/mm/hugetlb.c @@ -370,20 +370,15 @@ out: return page; } -int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, int write_access) +int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep) { int ret = VM_FAULT_SIGBUS; unsigned long idx; unsigned long size; - pte_t *pte; struct page *page; struct address_space *mapping; - pte = huge_pte_alloc(mm, address); - if (!pte) - goto out; - mapping = vma->vm_file->f_mapping; idx = ((address - vma->vm_start) >> HPAGE_SHIFT) + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); @@ -402,11 +397,11 @@ int hugetlb_fault(struct mm_struct *mm, goto backout; ret = VM_FAULT_MINOR; - if (!pte_none(*pte)) + if (!pte_none(*ptep)) goto backout; add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE); - set_huge_pte_at(mm, address, pte, make_huge_pte(vma, page)); + set_huge_pte_at(mm, address, ptep, make_huge_pte(vma, page)); spin_unlock(&mm->page_table_lock); unlock_page(page); out: @@ -420,6 +415,27 @@ backout: goto out; } +int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, int write_access) +{ + pte_t *ptep; + pte_t entry; + + ptep = huge_pte_alloc(mm, address); + if (!ptep) + return VM_FAULT_OOM; + + entry = *ptep; + if (pte_none(entry)) + return hugetlb_no_page(mm, vma, address, ptep); + + /* + * We could get here if another thread instantiated the pte + * before the test above. + */ + return VM_FAULT_MINOR; +} + int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, unsigned long *position, int *length, int i) -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 3/4] Hugetlb: Reorganize hugetlb_fault to prepare for COW @ 2005-11-09 23:38 ` Adam Litke 0 siblings, 0 replies; 28+ messages in thread From: Adam Litke @ 2005-11-09 23:38 UTC (permalink / raw) To: akpm Cc: linux-mm, linux-kernel, David Gibson, wli, hugh, rohit.seth, kenneth.w.chen, ADAM G. LITKE [imap] This patch splits the "no_page()" type activity into its own function, hugetlb_no_page(). hugetlb_fault() becomes the entry point for hugetlb faults and delegates to the appropriate handler depending on the type of fault. Right now we still have only hugetlb_no_page() but a later patch introduces a COW fault. Original post by David Gibson <david@gibson.dropbear.id.au> Version 2: Wed 9 Nov 2005 Broken out into a separate patch Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> --- hugetlb.c | 34 +++++++++++++++++++++++++--------- 1 files changed, 25 insertions(+), 9 deletions(-) diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c --- reference/mm/hugetlb.c +++ current/mm/hugetlb.c @@ -370,20 +370,15 @@ out: return page; } -int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, int write_access) +int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep) { int ret = VM_FAULT_SIGBUS; unsigned long idx; unsigned long size; - pte_t *pte; struct page *page; struct address_space *mapping; - pte = huge_pte_alloc(mm, address); - if (!pte) - goto out; - mapping = vma->vm_file->f_mapping; idx = ((address - vma->vm_start) >> HPAGE_SHIFT) + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); @@ -402,11 +397,11 @@ int hugetlb_fault(struct mm_struct *mm, goto backout; ret = VM_FAULT_MINOR; - if (!pte_none(*pte)) + if (!pte_none(*ptep)) goto backout; add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE); - set_huge_pte_at(mm, address, pte, make_huge_pte(vma, page)); + set_huge_pte_at(mm, address, ptep, make_huge_pte(vma, page)); spin_unlock(&mm->page_table_lock); unlock_page(page); out: @@ -420,6 +415,27 @@ backout: goto out; } +int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, int write_access) +{ + pte_t *ptep; + pte_t entry; + + ptep = huge_pte_alloc(mm, address); + if (!ptep) + return VM_FAULT_OOM; + + entry = *ptep; + if (pte_none(entry)) + return hugetlb_no_page(mm, vma, address, ptep); + + /* + * We could get here if another thread instantiated the pte + * before the test above. + */ + return VM_FAULT_MINOR; +} + int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, unsigned long *position, int *length, int i) -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/4] Hugetlb: Reorganize hugetlb_fault to prepare for COW 2005-11-09 23:38 ` Adam Litke @ 2005-11-10 0:13 ` William Lee Irwin III -1 siblings, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2005-11-10 0:13 UTC (permalink / raw) To: Adam Litke Cc: akpm, linux-mm, linux-kernel, David Gibson, hugh, rohit.seth, kenneth.w.chen On Wed, Nov 09, 2005 at 05:38:47PM -0600, Adam Litke wrote: > Hugetlb: Reorganize hugetlb_fault to prepare for COW > This patch splits the "no_page()" type activity into its own function, > hugetlb_no_page(). hugetlb_fault() becomes the entry point for hugetlb faults > and delegates to the appropriate handler depending on the type of fault. Right > now we still have only hugetlb_no_page() but a later patch introduces a COW > fault. > Original post by David Gibson <david@gibson.dropbear.id.au> > Version 2: Wed 9 Nov 2005 > Broken out into a separate patch > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Adam Litke <agl@us.ibm.com> Straightforward enough. Acked-by: William Irwin <wli@holomorphy.com> -- wli ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 3/4] Hugetlb: Reorganize hugetlb_fault to prepare for COW @ 2005-11-10 0:13 ` William Lee Irwin III 0 siblings, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2005-11-10 0:13 UTC (permalink / raw) To: Adam Litke Cc: akpm, linux-mm, linux-kernel, David Gibson, hugh, rohit.seth, kenneth.w.chen On Wed, Nov 09, 2005 at 05:38:47PM -0600, Adam Litke wrote: > Hugetlb: Reorganize hugetlb_fault to prepare for COW > This patch splits the "no_page()" type activity into its own function, > hugetlb_no_page(). hugetlb_fault() becomes the entry point for hugetlb faults > and delegates to the appropriate handler depending on the type of fault. Right > now we still have only hugetlb_no_page() but a later patch introduces a COW > fault. > Original post by David Gibson <david@gibson.dropbear.id.au> > Version 2: Wed 9 Nov 2005 > Broken out into a separate patch > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Adam Litke <agl@us.ibm.com> Straightforward enough. Acked-by: William Irwin <wli@holomorphy.com> -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 4/4] Hugetlb: Copy on Write support 2005-11-09 23:28 ` Adam Litke @ 2005-11-09 23:39 ` Adam Litke -1 siblings, 0 replies; 28+ messages in thread From: Adam Litke @ 2005-11-09 23:39 UTC (permalink / raw) To: akpm Cc: linux-mm, linux-kernel, David Gibson, wli, hugh, rohit.seth, kenneth.w.chen, ADAM G. LITKE [imap] Hugetlb: Copy on Write support Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be supported. This helps us to safely use hugetlb pages in many more applications. The patch makes the following changes. If needed, I also have it broken out according to the following paragraphs. 1. Add a pair of functions to set/clear write access on huge ptes. The writable check in make_huge_pte is moved out to the caller for use by COW later. 2. Hugetlb copy-on-write requires special case handling in the following situations: - copy_hugetlb_page_range() - Copied pages must be write protected so a COW fault will be triggered (if necessary) if those pages are written to. - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the page cache. MAP_PRIVATE pages still need to be locked however. 3. Provide hugetlb_cow() and calls from hugetlb_fault() and hugetlb_no_page() which handles the COW fault by making the actual copy. 4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps will be allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED mapping check. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> --- fs/hugetlbfs/inode.c | 3 - include/linux/hugetlb.h | 12 +++++ mm/hugetlb.c | 115 ++++++++++++++++++++++++++++++++++++++++-------- mm/mmap.c | 4 - 4 files changed, 110 insertions(+), 24 deletions(-) diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c --- reference/fs/hugetlbfs/inode.c +++ current/fs/hugetlbfs/inode.c @@ -100,9 +100,6 @@ static int hugetlbfs_file_mmap(struct fi loff_t len, vma_len; int ret; - if ((vma->vm_flags & (VM_MAYSHARE | VM_WRITE)) == VM_WRITE) - return -EINVAL; - if (vma->vm_pgoff & (HPAGE_SIZE / PAGE_SIZE - 1)) return -EINVAL; diff -upN reference/include/linux/hugetlb.h current/include/linux/hugetlb.h --- reference/include/linux/hugetlb.h +++ current/include/linux/hugetlb.h @@ -65,6 +65,18 @@ pte_t huge_ptep_get_and_clear(struct mm_ pte_t *ptep); #endif +#define huge_ptep_set_wrprotect(mm, addr, ptep) \ + ptep_set_wrprotect(mm, addr, ptep) +static inline void set_huge_ptep_writable(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep) +{ + pte_t entry; + + entry = pte_mkwrite(pte_mkdirty(*ptep)); + ptep_set_access_flags(vma, address, ptep, entry, 1); + update_mmu_cache(vma, address, entry); +} + #ifndef ARCH_HAS_HUGETLB_PREFAULT_HOOK #define hugetlb_prefault_arch_hook(mm) do { } while (0) #else diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c --- reference/mm/hugetlb.c +++ current/mm/hugetlb.c @@ -255,11 +255,12 @@ struct vm_operations_struct hugetlb_vm_o .nopage = hugetlb_nopage, }; -static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page) +static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page, + int writable) { pte_t entry; - if (vma->vm_flags & VM_WRITE) { + if (writable) { entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); } else { @@ -277,6 +278,9 @@ int copy_hugetlb_page_range(struct mm_st pte_t *src_pte, *dst_pte, entry; struct page *ptepage; unsigned long addr; + int cow; + + cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) { src_pte = huge_pte_offset(src, addr); @@ -288,6 +292,8 @@ int copy_hugetlb_page_range(struct mm_st spin_lock(&dst->page_table_lock); spin_lock(&src->page_table_lock); if (!pte_none(*src_pte)) { + if (cow) + huge_ptep_set_wrprotect(src, addr, src_pte); entry = *src_pte; ptepage = pte_page(entry); get_page(ptepage); @@ -340,7 +346,7 @@ void unmap_hugepage_range(struct vm_area } static struct page *find_or_alloc_huge_page(struct address_space *mapping, - unsigned long idx) + unsigned long idx, int shared) { struct page *page; int err; @@ -358,26 +364,80 @@ retry: goto out; } - err = add_to_page_cache(page, mapping, idx, GFP_KERNEL); - if (err) { - put_page(page); - hugetlb_put_quota(mapping); - if (err == -EEXIST) - goto retry; - page = NULL; + if (shared) { + err = add_to_page_cache(page, mapping, idx, GFP_KERNEL); + if (err) { + put_page(page); + hugetlb_put_quota(mapping); + if (err == -EEXIST) + goto retry; + page = NULL; + } + } else { + /* Caller expects a locked page */ + lock_page(page); } out: return page; } +static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, pte_t pte) +{ + struct page *old_page, *new_page; + int i, avoidcopy; + + old_page = pte_page(pte); + + /* If no-one else is actually using this page, avoid the copy + * and just make the page writable */ + avoidcopy = (page_count(old_page) == 1); + if (avoidcopy) { + set_huge_ptep_writable(vma, address, ptep); + return VM_FAULT_MINOR; + } + + page_cache_get(old_page); + new_page = alloc_huge_page(); + + if (! new_page) { + page_cache_release(old_page); + + /* Logically this is OOM, not a SIGBUS, but an OOM + * could cause the kernel to go killing other + * processes which won't help the hugepage situation + * at all (?) */ + return VM_FAULT_SIGBUS; + } + + spin_unlock(&mm->page_table_lock); + for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) + copy_user_highpage(new_page + i, old_page + i, + address + i*PAGE_SIZE); + spin_lock(&mm->page_table_lock); + + ptep = huge_pte_offset(mm, address & HPAGE_MASK); + if (likely(pte_same(*ptep, pte))) { + /* Break COW */ + set_huge_pte_at(mm, address, ptep, + make_huge_pte(vma, new_page, 1)); + /* Make the old page be freed below */ + new_page = old_page; + } + page_cache_release(new_page); + page_cache_release(old_page); + return VM_FAULT_MINOR; +} + int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, pte_t *ptep) + unsigned long address, pte_t *ptep, int write_access) { int ret = VM_FAULT_SIGBUS; unsigned long idx; unsigned long size; struct page *page; struct address_space *mapping; + pte_t new_pte; mapping = vma->vm_file->f_mapping; idx = ((address - vma->vm_start) >> HPAGE_SHIFT) @@ -387,10 +447,13 @@ int hugetlb_no_page(struct mm_struct *mm * Use page lock to guard against racing truncation * before we get page_table_lock. */ - page = find_or_alloc_huge_page(mapping, idx); + page = find_or_alloc_huge_page(mapping, idx, + vma->vm_flags & VM_SHARED); if (!page) goto out; + BUG_ON(!PageLocked(page)); + spin_lock(&mm->page_table_lock); size = i_size_read(mapping->host) >> HPAGE_SHIFT; if (idx >= size) @@ -401,7 +464,15 @@ int hugetlb_no_page(struct mm_struct *mm goto backout; add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE); - set_huge_pte_at(mm, address, ptep, make_huge_pte(vma, page)); + new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE) + && (vma->vm_flags & VM_SHARED))); + set_huge_pte_at(mm, address, ptep, new_pte); + + if (write_access && !(vma->vm_flags & VM_SHARED)) { + /* Optimization, do the COW without a second fault */ + ret = hugetlb_cow(mm, vma, address, ptep, new_pte); + } + spin_unlock(&mm->page_table_lock); unlock_page(page); out: @@ -420,6 +491,7 @@ int hugetlb_fault(struct mm_struct *mm, { pte_t *ptep; pte_t entry; + int ret; ptep = huge_pte_alloc(mm, address); if (!ptep) @@ -427,13 +499,18 @@ int hugetlb_fault(struct mm_struct *mm, entry = *ptep; if (pte_none(entry)) - return hugetlb_no_page(mm, vma, address, ptep); + return hugetlb_no_page(mm, vma, address, ptep, write_access); - /* - * We could get here if another thread instantiated the pte - * before the test above. - */ - return VM_FAULT_MINOR; + ret = VM_FAULT_MINOR; + + spin_lock(&mm->page_table_lock); + /* Check for a racing update before calling hugetlb_cow */ + if (likely(pte_same(entry, *ptep))) + if (write_access && !pte_write(entry)) + ret = hugetlb_cow(mm, vma, address, ptep, entry); + spin_unlock(&mm->page_table_lock); + + return ret; } int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, diff -upN reference/mm/mmap.c current/mm/mmap.c --- reference/mm/mmap.c +++ current/mm/mmap.c @@ -1077,8 +1077,8 @@ munmap_back: error = file->f_op->mmap(file, vma); if (error) goto unmap_and_free_vma; - if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED)) - == (VM_WRITE | VM_RESERVED)) { + if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED + | VM_HUGETLB)) == (VM_WRITE | VM_RESERVED)) { printk(KERN_WARNING "program %s is using MAP_PRIVATE, " "PROT_WRITE mmap of VM_RESERVED memory, which " "is deprecated. Please report this to " -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 4/4] Hugetlb: Copy on Write support @ 2005-11-09 23:39 ` Adam Litke 0 siblings, 0 replies; 28+ messages in thread From: Adam Litke @ 2005-11-09 23:39 UTC (permalink / raw) To: akpm Cc: linux-mm, linux-kernel, David Gibson, wli, hugh, rohit.seth, kenneth.w.chen, ADAM G. LITKE [imap] Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be supported. This helps us to safely use hugetlb pages in many more applications. The patch makes the following changes. If needed, I also have it broken out according to the following paragraphs. 1. Add a pair of functions to set/clear write access on huge ptes. The writable check in make_huge_pte is moved out to the caller for use by COW later. 2. Hugetlb copy-on-write requires special case handling in the following situations: - copy_hugetlb_page_range() - Copied pages must be write protected so a COW fault will be triggered (if necessary) if those pages are written to. - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the page cache. MAP_PRIVATE pages still need to be locked however. 3. Provide hugetlb_cow() and calls from hugetlb_fault() and hugetlb_no_page() which handles the COW fault by making the actual copy. 4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps will be allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED mapping check. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> --- fs/hugetlbfs/inode.c | 3 - include/linux/hugetlb.h | 12 +++++ mm/hugetlb.c | 115 ++++++++++++++++++++++++++++++++++++++++-------- mm/mmap.c | 4 - 4 files changed, 110 insertions(+), 24 deletions(-) diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c --- reference/fs/hugetlbfs/inode.c +++ current/fs/hugetlbfs/inode.c @@ -100,9 +100,6 @@ static int hugetlbfs_file_mmap(struct fi loff_t len, vma_len; int ret; - if ((vma->vm_flags & (VM_MAYSHARE | VM_WRITE)) == VM_WRITE) - return -EINVAL; - if (vma->vm_pgoff & (HPAGE_SIZE / PAGE_SIZE - 1)) return -EINVAL; diff -upN reference/include/linux/hugetlb.h current/include/linux/hugetlb.h --- reference/include/linux/hugetlb.h +++ current/include/linux/hugetlb.h @@ -65,6 +65,18 @@ pte_t huge_ptep_get_and_clear(struct mm_ pte_t *ptep); #endif +#define huge_ptep_set_wrprotect(mm, addr, ptep) \ + ptep_set_wrprotect(mm, addr, ptep) +static inline void set_huge_ptep_writable(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep) +{ + pte_t entry; + + entry = pte_mkwrite(pte_mkdirty(*ptep)); + ptep_set_access_flags(vma, address, ptep, entry, 1); + update_mmu_cache(vma, address, entry); +} + #ifndef ARCH_HAS_HUGETLB_PREFAULT_HOOK #define hugetlb_prefault_arch_hook(mm) do { } while (0) #else diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c --- reference/mm/hugetlb.c +++ current/mm/hugetlb.c @@ -255,11 +255,12 @@ struct vm_operations_struct hugetlb_vm_o .nopage = hugetlb_nopage, }; -static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page) +static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page, + int writable) { pte_t entry; - if (vma->vm_flags & VM_WRITE) { + if (writable) { entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); } else { @@ -277,6 +278,9 @@ int copy_hugetlb_page_range(struct mm_st pte_t *src_pte, *dst_pte, entry; struct page *ptepage; unsigned long addr; + int cow; + + cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) { src_pte = huge_pte_offset(src, addr); @@ -288,6 +292,8 @@ int copy_hugetlb_page_range(struct mm_st spin_lock(&dst->page_table_lock); spin_lock(&src->page_table_lock); if (!pte_none(*src_pte)) { + if (cow) + huge_ptep_set_wrprotect(src, addr, src_pte); entry = *src_pte; ptepage = pte_page(entry); get_page(ptepage); @@ -340,7 +346,7 @@ void unmap_hugepage_range(struct vm_area } static struct page *find_or_alloc_huge_page(struct address_space *mapping, - unsigned long idx) + unsigned long idx, int shared) { struct page *page; int err; @@ -358,26 +364,80 @@ retry: goto out; } - err = add_to_page_cache(page, mapping, idx, GFP_KERNEL); - if (err) { - put_page(page); - hugetlb_put_quota(mapping); - if (err == -EEXIST) - goto retry; - page = NULL; + if (shared) { + err = add_to_page_cache(page, mapping, idx, GFP_KERNEL); + if (err) { + put_page(page); + hugetlb_put_quota(mapping); + if (err == -EEXIST) + goto retry; + page = NULL; + } + } else { + /* Caller expects a locked page */ + lock_page(page); } out: return page; } +static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, pte_t pte) +{ + struct page *old_page, *new_page; + int i, avoidcopy; + + old_page = pte_page(pte); + + /* If no-one else is actually using this page, avoid the copy + * and just make the page writable */ + avoidcopy = (page_count(old_page) == 1); + if (avoidcopy) { + set_huge_ptep_writable(vma, address, ptep); + return VM_FAULT_MINOR; + } + + page_cache_get(old_page); + new_page = alloc_huge_page(); + + if (! new_page) { + page_cache_release(old_page); + + /* Logically this is OOM, not a SIGBUS, but an OOM + * could cause the kernel to go killing other + * processes which won't help the hugepage situation + * at all (?) */ + return VM_FAULT_SIGBUS; + } + + spin_unlock(&mm->page_table_lock); + for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) + copy_user_highpage(new_page + i, old_page + i, + address + i*PAGE_SIZE); + spin_lock(&mm->page_table_lock); + + ptep = huge_pte_offset(mm, address & HPAGE_MASK); + if (likely(pte_same(*ptep, pte))) { + /* Break COW */ + set_huge_pte_at(mm, address, ptep, + make_huge_pte(vma, new_page, 1)); + /* Make the old page be freed below */ + new_page = old_page; + } + page_cache_release(new_page); + page_cache_release(old_page); + return VM_FAULT_MINOR; +} + int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, pte_t *ptep) + unsigned long address, pte_t *ptep, int write_access) { int ret = VM_FAULT_SIGBUS; unsigned long idx; unsigned long size; struct page *page; struct address_space *mapping; + pte_t new_pte; mapping = vma->vm_file->f_mapping; idx = ((address - vma->vm_start) >> HPAGE_SHIFT) @@ -387,10 +447,13 @@ int hugetlb_no_page(struct mm_struct *mm * Use page lock to guard against racing truncation * before we get page_table_lock. */ - page = find_or_alloc_huge_page(mapping, idx); + page = find_or_alloc_huge_page(mapping, idx, + vma->vm_flags & VM_SHARED); if (!page) goto out; + BUG_ON(!PageLocked(page)); + spin_lock(&mm->page_table_lock); size = i_size_read(mapping->host) >> HPAGE_SHIFT; if (idx >= size) @@ -401,7 +464,15 @@ int hugetlb_no_page(struct mm_struct *mm goto backout; add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE); - set_huge_pte_at(mm, address, ptep, make_huge_pte(vma, page)); + new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE) + && (vma->vm_flags & VM_SHARED))); + set_huge_pte_at(mm, address, ptep, new_pte); + + if (write_access && !(vma->vm_flags & VM_SHARED)) { + /* Optimization, do the COW without a second fault */ + ret = hugetlb_cow(mm, vma, address, ptep, new_pte); + } + spin_unlock(&mm->page_table_lock); unlock_page(page); out: @@ -420,6 +491,7 @@ int hugetlb_fault(struct mm_struct *mm, { pte_t *ptep; pte_t entry; + int ret; ptep = huge_pte_alloc(mm, address); if (!ptep) @@ -427,13 +499,18 @@ int hugetlb_fault(struct mm_struct *mm, entry = *ptep; if (pte_none(entry)) - return hugetlb_no_page(mm, vma, address, ptep); + return hugetlb_no_page(mm, vma, address, ptep, write_access); - /* - * We could get here if another thread instantiated the pte - * before the test above. - */ - return VM_FAULT_MINOR; + ret = VM_FAULT_MINOR; + + spin_lock(&mm->page_table_lock); + /* Check for a racing update before calling hugetlb_cow */ + if (likely(pte_same(entry, *ptep))) + if (write_access && !pte_write(entry)) + ret = hugetlb_cow(mm, vma, address, ptep, entry); + spin_unlock(&mm->page_table_lock); + + return ret; } int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, diff -upN reference/mm/mmap.c current/mm/mmap.c --- reference/mm/mmap.c +++ current/mm/mmap.c @@ -1077,8 +1077,8 @@ munmap_back: error = file->f_op->mmap(file, vma); if (error) goto unmap_and_free_vma; - if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED)) - == (VM_WRITE | VM_RESERVED)) { + if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED + | VM_HUGETLB)) == (VM_WRITE | VM_RESERVED)) { printk(KERN_WARNING "program %s is using MAP_PRIVATE, " "PROT_WRITE mmap of VM_RESERVED memory, which " "is deprecated. Please report this to " -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support 2005-11-09 23:39 ` Adam Litke @ 2005-11-10 0:15 ` William Lee Irwin III -1 siblings, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2005-11-10 0:15 UTC (permalink / raw) To: Adam Litke Cc: akpm, linux-mm, linux-kernel, David Gibson, hugh, rohit.seth, kenneth.w.chen On Wed, Nov 09, 2005 at 05:39:55PM -0600, Adam Litke wrote: > Hugetlb: Copy on Write support > Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be > supported. This helps us to safely use hugetlb pages in many more > applications. The patch makes the following changes. If needed, I also have > it broken out according to the following paragraphs. > 1. Add a pair of functions to set/clear write access on huge ptes. The > writable check in make_huge_pte is moved out to the caller for use by COW > later. > 2. Hugetlb copy-on-write requires special case handling in the following > situations: > - copy_hugetlb_page_range() - Copied pages must be write protected so a COW > fault will be triggered (if necessary) if those pages are written to. > - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the page > cache. MAP_PRIVATE pages still need to be locked however. > 3. Provide hugetlb_cow() and calls from hugetlb_fault() and hugetlb_no_page() > which handles the COW fault by making the actual copy. > 4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps will be > allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED mapping > check. Did you do the audit of pte protection bits I asked about? If not, I'll dredge them up and check to make sure. -- wli ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support @ 2005-11-10 0:15 ` William Lee Irwin III 0 siblings, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2005-11-10 0:15 UTC (permalink / raw) To: Adam Litke Cc: akpm, linux-mm, linux-kernel, David Gibson, hugh, rohit.seth, kenneth.w.chen On Wed, Nov 09, 2005 at 05:39:55PM -0600, Adam Litke wrote: > Hugetlb: Copy on Write support > Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be > supported. This helps us to safely use hugetlb pages in many more > applications. The patch makes the following changes. If needed, I also have > it broken out according to the following paragraphs. > 1. Add a pair of functions to set/clear write access on huge ptes. The > writable check in make_huge_pte is moved out to the caller for use by COW > later. > 2. Hugetlb copy-on-write requires special case handling in the following > situations: > - copy_hugetlb_page_range() - Copied pages must be write protected so a COW > fault will be triggered (if necessary) if those pages are written to. > - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the page > cache. MAP_PRIVATE pages still need to be locked however. > 3. Provide hugetlb_cow() and calls from hugetlb_fault() and hugetlb_no_page() > which handles the COW fault by making the actual copy. > 4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps will be > allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED mapping > check. Did you do the audit of pte protection bits I asked about? If not, I'll dredge them up and check to make sure. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support 2005-11-10 0:15 ` William Lee Irwin III @ 2005-11-10 0:49 ` David Gibson -1 siblings, 0 replies; 28+ messages in thread From: David Gibson @ 2005-11-10 0:49 UTC (permalink / raw) To: William Lee Irwin III Cc: Adam Litke, akpm, linux-mm, linux-kernel, hugh, rohit.seth, kenneth.w.chen On Wed, Nov 09, 2005 at 04:15:34PM -0800, William Lee Irwin wrote: > On Wed, Nov 09, 2005 at 05:39:55PM -0600, Adam Litke wrote: > > Hugetlb: Copy on Write support > > Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be > > supported. This helps us to safely use hugetlb pages in many more > > applications. The patch makes the following changes. If needed, I also have > > it broken out according to the following paragraphs. > > 1. Add a pair of functions to set/clear write access on huge ptes. The > > writable check in make_huge_pte is moved out to the caller for use by COW > > later. > > 2. Hugetlb copy-on-write requires special case handling in the following > > situations: > > - copy_hugetlb_page_range() - Copied pages must be write protected so a COW > > fault will be triggered (if necessary) if those pages are written to. > > - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the page > > cache. MAP_PRIVATE pages still need to be locked however. > > 3. Provide hugetlb_cow() and calls from hugetlb_fault() and hugetlb_no_page() > > which handles the COW fault by making the actual copy. > > 4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps will be > > allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED mapping > > check. > > Did you do the audit of pte protection bits I asked about? If not, I'll > dredge them up and check to make sure. I still don't know what you're talking about here - you never responded to my mail asking for clarification. The hugepage code already relies on pte_mkwrite() and pte_wrprotect() working correctly, I don't see that COW makes any difference. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support @ 2005-11-10 0:49 ` David Gibson 0 siblings, 0 replies; 28+ messages in thread From: David Gibson @ 2005-11-10 0:49 UTC (permalink / raw) To: William Lee Irwin III Cc: Adam Litke, akpm, linux-mm, linux-kernel, hugh, rohit.seth, kenneth.w.chen On Wed, Nov 09, 2005 at 04:15:34PM -0800, William Lee Irwin wrote: > On Wed, Nov 09, 2005 at 05:39:55PM -0600, Adam Litke wrote: > > Hugetlb: Copy on Write support > > Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be > > supported. This helps us to safely use hugetlb pages in many more > > applications. The patch makes the following changes. If needed, I also have > > it broken out according to the following paragraphs. > > 1. Add a pair of functions to set/clear write access on huge ptes. The > > writable check in make_huge_pte is moved out to the caller for use by COW > > later. > > 2. Hugetlb copy-on-write requires special case handling in the following > > situations: > > - copy_hugetlb_page_range() - Copied pages must be write protected so a COW > > fault will be triggered (if necessary) if those pages are written to. > > - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the page > > cache. MAP_PRIVATE pages still need to be locked however. > > 3. Provide hugetlb_cow() and calls from hugetlb_fault() and hugetlb_no_page() > > which handles the COW fault by making the actual copy. > > 4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps will be > > allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED mapping > > check. > > Did you do the audit of pte protection bits I asked about? If not, I'll > dredge them up and check to make sure. I still don't know what you're talking about here - you never responded to my mail asking for clarification. The hugepage code already relies on pte_mkwrite() and pte_wrprotect() working correctly, I don't see that COW makes any difference. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support 2005-11-10 0:49 ` David Gibson @ 2005-11-10 0:56 ` William Lee Irwin III -1 siblings, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2005-11-10 0:56 UTC (permalink / raw) To: David Gibson Cc: Adam Litke, akpm, linux-mm, linux-kernel, hugh, rohit.seth, kenneth.w.chen On Wed, Nov 09, 2005 at 04:15:34PM -0800, William Lee Irwin wrote: >> Did you do the audit of pte protection bits I asked about? If not, I'll >> dredge them up and check to make sure. On Thu, Nov 10, 2005 at 11:49:07AM +1100, David Gibson wrote: > I still don't know what you're talking about here - you never > responded to my mail asking for clarification. The hugepage code > already relies on pte_mkwrite() and pte_wrprotect() working correctly, > I don't see that COW makes any difference. You appear to have a good idea of what's going on given that you've reminded me of that reliance. It looks like I dropped that email packet for some reason, sorry about that. Acked-by: William Irwin <wli@holomorphy.com> -- wli ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support @ 2005-11-10 0:56 ` William Lee Irwin III 0 siblings, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2005-11-10 0:56 UTC (permalink / raw) To: David Gibson Cc: Adam Litke, akpm, linux-mm, linux-kernel, hugh, rohit.seth, kenneth.w.chen On Wed, Nov 09, 2005 at 04:15:34PM -0800, William Lee Irwin wrote: >> Did you do the audit of pte protection bits I asked about? If not, I'll >> dredge them up and check to make sure. On Thu, Nov 10, 2005 at 11:49:07AM +1100, David Gibson wrote: > I still don't know what you're talking about here - you never > responded to my mail asking for clarification. The hugepage code > already relies on pte_mkwrite() and pte_wrprotect() working correctly, > I don't see that COW makes any difference. You appear to have a good idea of what's going on given that you've reminded me of that reliance. It looks like I dropped that email packet for some reason, sorry about that. Acked-by: William Irwin <wli@holomorphy.com> -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support 2005-11-09 23:39 ` Adam Litke @ 2005-11-10 1:52 ` Rohit Seth -1 siblings, 0 replies; 28+ messages in thread From: Rohit Seth @ 2005-11-10 1:52 UTC (permalink / raw) To: Adam Litke Cc: akpm, linux-mm, linux-kernel, David Gibson, wli, hugh, kenneth.w.chen On Wed, 2005-11-09 at 17:39 -0600, Adam Litke wrote: > > +#define huge_ptep_set_wrprotect(mm, addr, ptep) \ > + ptep_set_wrprotect(mm, addr, ptep) > +static inline void set_huge_ptep_writable(struct vm_area_struct *vma, > + unsigned long address, pte_t *ptep) > +{ > + pte_t entry; > + > + entry = pte_mkwrite(pte_mkdirty(*ptep)); > + ptep_set_access_flags(vma, address, ptep, entry, 1); > + update_mmu_cache(vma, address, entry); > +} lazy_mmu_prot_update will need to called here to make caches coherent for some archs. -rohit ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support @ 2005-11-10 1:52 ` Rohit Seth 0 siblings, 0 replies; 28+ messages in thread From: Rohit Seth @ 2005-11-10 1:52 UTC (permalink / raw) To: Adam Litke Cc: akpm, linux-mm, linux-kernel, David Gibson, wli, hugh, kenneth.w.chen On Wed, 2005-11-09 at 17:39 -0600, Adam Litke wrote: > > +#define huge_ptep_set_wrprotect(mm, addr, ptep) \ > + ptep_set_wrprotect(mm, addr, ptep) > +static inline void set_huge_ptep_writable(struct vm_area_struct *vma, > + unsigned long address, pte_t *ptep) > +{ > + pte_t entry; > + > + entry = pte_mkwrite(pte_mkdirty(*ptep)); > + ptep_set_access_flags(vma, address, ptep, entry, 1); > + update_mmu_cache(vma, address, entry); > +} lazy_mmu_prot_update will need to called here to make caches coherent for some archs. -rohit -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support 2005-11-10 1:52 ` Rohit Seth @ 2005-11-10 3:54 ` David Gibson -1 siblings, 0 replies; 28+ messages in thread From: David Gibson @ 2005-11-10 3:54 UTC (permalink / raw) To: Rohit Seth Cc: Adam Litke, akpm, linux-mm, linux-kernel, wli, hugh, kenneth.w.chen On Wed, Nov 09, 2005 at 05:52:44PM -0800, Rohit Seth wrote: > On Wed, 2005-11-09 at 17:39 -0600, Adam Litke wrote: > > > > > +#define huge_ptep_set_wrprotect(mm, addr, ptep) \ > > + ptep_set_wrprotect(mm, addr, ptep) > > +static inline void set_huge_ptep_writable(struct vm_area_struct *vma, > > + unsigned long address, pte_t *ptep) > > +{ > > + pte_t entry; > > + > > + entry = pte_mkwrite(pte_mkdirty(*ptep)); > > + ptep_set_access_flags(vma, address, ptep, entry, 1); > > + update_mmu_cache(vma, address, entry); > > +} > > lazy_mmu_prot_update will need to called here to make caches coherent > for some archs. Ah, yes indeed. Revised version below. While I was at it, I moved set_huge_ptep_writable() into mm/hugetlb.c, since there's no actual need for it to be in the .h, and abolished huge_ptep_set_wrprotect() since there's no need for the macro at all. Hugetlb: Copy on Write support Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be supported. This helps us to safely use hugetlb pages in many more applications. The patch makes the following changes. If needed, I also have it broken out according to the following paragraphs. 1. Add a pair of functions to set/clear write access on huge ptes. The writable check in make_huge_pte is moved out to the caller for use by COW later. 2. Hugetlb copy-on-write requires special case handling in the following situations: - copy_hugetlb_page_range() - Copied pages must be write protected so a COW fault will be triggered (if necessary) if those pages are written to. - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the page cache. MAP_PRIVATE pages still need to be locked however. 3. Provide hugetlb_cow() and calls from hugetlb_fault() and hugetlb_no_page() which handles the COW fault by making the actual copy. 4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps will be allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED mapping check. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Index: working-2.6/fs/hugetlbfs/inode.c =================================================================== --- working-2.6.orig/fs/hugetlbfs/inode.c 2005-11-10 14:41:51.000000000 +1100 +++ working-2.6/fs/hugetlbfs/inode.c 2005-11-10 14:44:13.000000000 +1100 @@ -100,9 +100,6 @@ loff_t len, vma_len; int ret; - if ((vma->vm_flags & (VM_MAYSHARE | VM_WRITE)) == VM_WRITE) - return -EINVAL; - if (vma->vm_pgoff & (HPAGE_SIZE / PAGE_SIZE - 1)) return -EINVAL; Index: working-2.6/mm/hugetlb.c =================================================================== --- working-2.6.orig/mm/hugetlb.c 2005-11-10 14:41:51.000000000 +1100 +++ working-2.6/mm/hugetlb.c 2005-11-10 14:44:13.000000000 +1100 @@ -255,11 +255,12 @@ .nopage = hugetlb_nopage, }; -static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page) +static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page, + int writable) { pte_t entry; - if (vma->vm_flags & VM_WRITE) { + if (writable) { entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); } else { @@ -271,12 +272,27 @@ return entry; } +static void set_huge_ptep_writable(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep) +{ + pte_t entry; + + entry = pte_mkwrite(pte_mkdirty(*ptep)); + ptep_set_access_flags(vma, address, ptep, entry, 1); + update_mmu_cache(vma, address, entry); + lazy_mmu_prot_update(entry); +} + + int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma) { pte_t *src_pte, *dst_pte, entry; struct page *ptepage; unsigned long addr; + int cow; + + cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) { src_pte = huge_pte_offset(src, addr); @@ -288,6 +304,8 @@ spin_lock(&dst->page_table_lock); spin_lock(&src->page_table_lock); if (!pte_none(*src_pte)) { + if (cow) + ptep_set_wrprotect(src, addr, src_pte); entry = *src_pte; ptepage = pte_page(entry); get_page(ptepage); @@ -340,7 +358,7 @@ } static struct page *find_or_alloc_huge_page(struct address_space *mapping, - unsigned long idx) + unsigned long idx, int shared) { struct page *page; int err; @@ -358,26 +376,80 @@ goto out; } - err = add_to_page_cache(page, mapping, idx, GFP_KERNEL); - if (err) { - put_page(page); - hugetlb_put_quota(mapping); - if (err == -EEXIST) - goto retry; - page = NULL; + if (shared) { + err = add_to_page_cache(page, mapping, idx, GFP_KERNEL); + if (err) { + put_page(page); + hugetlb_put_quota(mapping); + if (err == -EEXIST) + goto retry; + page = NULL; + } + } else { + /* Caller expects a locked page */ + lock_page(page); } out: return page; } +static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, pte_t pte) +{ + struct page *old_page, *new_page; + int i, avoidcopy; + + old_page = pte_page(pte); + + /* If no-one else is actually using this page, avoid the copy + * and just make the page writable */ + avoidcopy = (page_count(old_page) == 1); + if (avoidcopy) { + set_huge_ptep_writable(vma, address, ptep); + return VM_FAULT_MINOR; + } + + page_cache_get(old_page); + new_page = alloc_huge_page(); + + if (! new_page) { + page_cache_release(old_page); + + /* Logically this is OOM, not a SIGBUS, but an OOM + * could cause the kernel to go killing other + * processes which won't help the hugepage situation + * at all (?) */ + return VM_FAULT_SIGBUS; + } + + spin_unlock(&mm->page_table_lock); + for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) + copy_user_highpage(new_page + i, old_page + i, + address + i*PAGE_SIZE); + spin_lock(&mm->page_table_lock); + + ptep = huge_pte_offset(mm, address & HPAGE_MASK); + if (likely(pte_same(*ptep, pte))) { + /* Break COW */ + set_huge_pte_at(mm, address, ptep, + make_huge_pte(vma, new_page, 1)); + /* Make the old page be freed below */ + new_page = old_page; + } + page_cache_release(new_page); + page_cache_release(old_page); + return VM_FAULT_MINOR; +} + int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, pte_t *ptep) + unsigned long address, pte_t *ptep, int write_access) { int ret = VM_FAULT_SIGBUS; unsigned long idx; unsigned long size; struct page *page; struct address_space *mapping; + pte_t new_pte; mapping = vma->vm_file->f_mapping; idx = ((address - vma->vm_start) >> HPAGE_SHIFT) @@ -387,10 +459,13 @@ * Use page lock to guard against racing truncation * before we get page_table_lock. */ - page = find_or_alloc_huge_page(mapping, idx); + page = find_or_alloc_huge_page(mapping, idx, + vma->vm_flags & VM_SHARED); if (!page) goto out; + BUG_ON(!PageLocked(page)); + spin_lock(&mm->page_table_lock); size = i_size_read(mapping->host) >> HPAGE_SHIFT; if (idx >= size) @@ -401,7 +476,15 @@ goto backout; add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE); - set_huge_pte_at(mm, address, ptep, make_huge_pte(vma, page)); + new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE) + && (vma->vm_flags & VM_SHARED))); + set_huge_pte_at(mm, address, ptep, new_pte); + + if (write_access && !(vma->vm_flags & VM_SHARED)) { + /* Optimization, do the COW without a second fault */ + ret = hugetlb_cow(mm, vma, address, ptep, new_pte); + } + spin_unlock(&mm->page_table_lock); unlock_page(page); out: @@ -420,6 +503,7 @@ { pte_t *ptep; pte_t entry; + int ret; ptep = huge_pte_alloc(mm, address); if (!ptep) @@ -427,13 +511,18 @@ entry = *ptep; if (pte_none(entry)) - return hugetlb_no_page(mm, vma, address, ptep); + return hugetlb_no_page(mm, vma, address, ptep, write_access); - /* - * We could get here if another thread instantiated the pte - * before the test above. - */ - return VM_FAULT_MINOR; + ret = VM_FAULT_MINOR; + + spin_lock(&mm->page_table_lock); + /* Check for a racing update before calling hugetlb_cow */ + if (likely(pte_same(entry, *ptep))) + if (write_access && !pte_write(entry)) + ret = hugetlb_cow(mm, vma, address, ptep, entry); + spin_unlock(&mm->page_table_lock); + + return ret; } int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, Index: working-2.6/mm/mmap.c =================================================================== --- working-2.6.orig/mm/mmap.c 2005-11-10 14:41:51.000000000 +1100 +++ working-2.6/mm/mmap.c 2005-11-10 14:44:13.000000000 +1100 @@ -1076,8 +1076,9 @@ error = file->f_op->mmap(file, vma); if (error) goto unmap_and_free_vma; - if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED)) - == (VM_WRITE | VM_RESERVED)) { + if ((vma->vm_flags + & (VM_SHARED | VM_WRITE | VM_RESERVED | VM_HUGETLB)) + == (VM_WRITE | VM_RESERVED)) { printk(KERN_WARNING "program %s is using MAP_PRIVATE, " "PROT_WRITE mmap of VM_RESERVED memory, which " "is deprecated. Please report this to " -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support @ 2005-11-10 3:54 ` David Gibson 0 siblings, 0 replies; 28+ messages in thread From: David Gibson @ 2005-11-10 3:54 UTC (permalink / raw) To: Rohit Seth Cc: Adam Litke, akpm, linux-mm, linux-kernel, wli, hugh, kenneth.w.chen On Wed, Nov 09, 2005 at 05:52:44PM -0800, Rohit Seth wrote: > On Wed, 2005-11-09 at 17:39 -0600, Adam Litke wrote: > > > > > +#define huge_ptep_set_wrprotect(mm, addr, ptep) \ > > + ptep_set_wrprotect(mm, addr, ptep) > > +static inline void set_huge_ptep_writable(struct vm_area_struct *vma, > > + unsigned long address, pte_t *ptep) > > +{ > > + pte_t entry; > > + > > + entry = pte_mkwrite(pte_mkdirty(*ptep)); > > + ptep_set_access_flags(vma, address, ptep, entry, 1); > > + update_mmu_cache(vma, address, entry); > > +} > > lazy_mmu_prot_update will need to called here to make caches coherent > for some archs. Ah, yes indeed. Revised version below. While I was at it, I moved set_huge_ptep_writable() into mm/hugetlb.c, since there's no actual need for it to be in the .h, and abolished huge_ptep_set_wrprotect() since there's no need for the macro at all. Hugetlb: Copy on Write support Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be supported. This helps us to safely use hugetlb pages in many more applications. The patch makes the following changes. If needed, I also have it broken out according to the following paragraphs. 1. Add a pair of functions to set/clear write access on huge ptes. The writable check in make_huge_pte is moved out to the caller for use by COW later. 2. Hugetlb copy-on-write requires special case handling in the following situations: - copy_hugetlb_page_range() - Copied pages must be write protected so a COW fault will be triggered (if necessary) if those pages are written to. - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the page cache. MAP_PRIVATE pages still need to be locked however. 3. Provide hugetlb_cow() and calls from hugetlb_fault() and hugetlb_no_page() which handles the COW fault by making the actual copy. 4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps will be allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED mapping check. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Index: working-2.6/fs/hugetlbfs/inode.c =================================================================== --- working-2.6.orig/fs/hugetlbfs/inode.c 2005-11-10 14:41:51.000000000 +1100 +++ working-2.6/fs/hugetlbfs/inode.c 2005-11-10 14:44:13.000000000 +1100 @@ -100,9 +100,6 @@ loff_t len, vma_len; int ret; - if ((vma->vm_flags & (VM_MAYSHARE | VM_WRITE)) == VM_WRITE) - return -EINVAL; - if (vma->vm_pgoff & (HPAGE_SIZE / PAGE_SIZE - 1)) return -EINVAL; Index: working-2.6/mm/hugetlb.c =================================================================== --- working-2.6.orig/mm/hugetlb.c 2005-11-10 14:41:51.000000000 +1100 +++ working-2.6/mm/hugetlb.c 2005-11-10 14:44:13.000000000 +1100 @@ -255,11 +255,12 @@ .nopage = hugetlb_nopage, }; -static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page) +static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page, + int writable) { pte_t entry; - if (vma->vm_flags & VM_WRITE) { + if (writable) { entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); } else { @@ -271,12 +272,27 @@ return entry; } +static void set_huge_ptep_writable(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep) +{ + pte_t entry; + + entry = pte_mkwrite(pte_mkdirty(*ptep)); + ptep_set_access_flags(vma, address, ptep, entry, 1); + update_mmu_cache(vma, address, entry); + lazy_mmu_prot_update(entry); +} + + int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma) { pte_t *src_pte, *dst_pte, entry; struct page *ptepage; unsigned long addr; + int cow; + + cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) { src_pte = huge_pte_offset(src, addr); @@ -288,6 +304,8 @@ spin_lock(&dst->page_table_lock); spin_lock(&src->page_table_lock); if (!pte_none(*src_pte)) { + if (cow) + ptep_set_wrprotect(src, addr, src_pte); entry = *src_pte; ptepage = pte_page(entry); get_page(ptepage); @@ -340,7 +358,7 @@ } static struct page *find_or_alloc_huge_page(struct address_space *mapping, - unsigned long idx) + unsigned long idx, int shared) { struct page *page; int err; @@ -358,26 +376,80 @@ goto out; } - err = add_to_page_cache(page, mapping, idx, GFP_KERNEL); - if (err) { - put_page(page); - hugetlb_put_quota(mapping); - if (err == -EEXIST) - goto retry; - page = NULL; + if (shared) { + err = add_to_page_cache(page, mapping, idx, GFP_KERNEL); + if (err) { + put_page(page); + hugetlb_put_quota(mapping); + if (err == -EEXIST) + goto retry; + page = NULL; + } + } else { + /* Caller expects a locked page */ + lock_page(page); } out: return page; } +static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, pte_t pte) +{ + struct page *old_page, *new_page; + int i, avoidcopy; + + old_page = pte_page(pte); + + /* If no-one else is actually using this page, avoid the copy + * and just make the page writable */ + avoidcopy = (page_count(old_page) == 1); + if (avoidcopy) { + set_huge_ptep_writable(vma, address, ptep); + return VM_FAULT_MINOR; + } + + page_cache_get(old_page); + new_page = alloc_huge_page(); + + if (! new_page) { + page_cache_release(old_page); + + /* Logically this is OOM, not a SIGBUS, but an OOM + * could cause the kernel to go killing other + * processes which won't help the hugepage situation + * at all (?) */ + return VM_FAULT_SIGBUS; + } + + spin_unlock(&mm->page_table_lock); + for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) + copy_user_highpage(new_page + i, old_page + i, + address + i*PAGE_SIZE); + spin_lock(&mm->page_table_lock); + + ptep = huge_pte_offset(mm, address & HPAGE_MASK); + if (likely(pte_same(*ptep, pte))) { + /* Break COW */ + set_huge_pte_at(mm, address, ptep, + make_huge_pte(vma, new_page, 1)); + /* Make the old page be freed below */ + new_page = old_page; + } + page_cache_release(new_page); + page_cache_release(old_page); + return VM_FAULT_MINOR; +} + int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, pte_t *ptep) + unsigned long address, pte_t *ptep, int write_access) { int ret = VM_FAULT_SIGBUS; unsigned long idx; unsigned long size; struct page *page; struct address_space *mapping; + pte_t new_pte; mapping = vma->vm_file->f_mapping; idx = ((address - vma->vm_start) >> HPAGE_SHIFT) @@ -387,10 +459,13 @@ * Use page lock to guard against racing truncation * before we get page_table_lock. */ - page = find_or_alloc_huge_page(mapping, idx); + page = find_or_alloc_huge_page(mapping, idx, + vma->vm_flags & VM_SHARED); if (!page) goto out; + BUG_ON(!PageLocked(page)); + spin_lock(&mm->page_table_lock); size = i_size_read(mapping->host) >> HPAGE_SHIFT; if (idx >= size) @@ -401,7 +476,15 @@ goto backout; add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE); - set_huge_pte_at(mm, address, ptep, make_huge_pte(vma, page)); + new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE) + && (vma->vm_flags & VM_SHARED))); + set_huge_pte_at(mm, address, ptep, new_pte); + + if (write_access && !(vma->vm_flags & VM_SHARED)) { + /* Optimization, do the COW without a second fault */ + ret = hugetlb_cow(mm, vma, address, ptep, new_pte); + } + spin_unlock(&mm->page_table_lock); unlock_page(page); out: @@ -420,6 +503,7 @@ { pte_t *ptep; pte_t entry; + int ret; ptep = huge_pte_alloc(mm, address); if (!ptep) @@ -427,13 +511,18 @@ entry = *ptep; if (pte_none(entry)) - return hugetlb_no_page(mm, vma, address, ptep); + return hugetlb_no_page(mm, vma, address, ptep, write_access); - /* - * We could get here if another thread instantiated the pte - * before the test above. - */ - return VM_FAULT_MINOR; + ret = VM_FAULT_MINOR; + + spin_lock(&mm->page_table_lock); + /* Check for a racing update before calling hugetlb_cow */ + if (likely(pte_same(entry, *ptep))) + if (write_access && !pte_write(entry)) + ret = hugetlb_cow(mm, vma, address, ptep, entry); + spin_unlock(&mm->page_table_lock); + + return ret; } int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, Index: working-2.6/mm/mmap.c =================================================================== --- working-2.6.orig/mm/mmap.c 2005-11-10 14:41:51.000000000 +1100 +++ working-2.6/mm/mmap.c 2005-11-10 14:44:13.000000000 +1100 @@ -1076,8 +1076,9 @@ error = file->f_op->mmap(file, vma); if (error) goto unmap_and_free_vma; - if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED)) - == (VM_WRITE | VM_RESERVED)) { + if ((vma->vm_flags + & (VM_SHARED | VM_WRITE | VM_RESERVED | VM_HUGETLB)) + == (VM_WRITE | VM_RESERVED)) { printk(KERN_WARNING "program %s is using MAP_PRIVATE, " "PROT_WRITE mmap of VM_RESERVED memory, which " "is deprecated. Please report this to " -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support 2005-11-10 3:54 ` David Gibson @ 2005-11-10 4:20 ` William Lee Irwin III -1 siblings, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2005-11-10 4:20 UTC (permalink / raw) To: David Gibson Cc: Rohit Seth, Adam Litke, akpm, linux-mm, linux-kernel, hugh, kenneth.w.chen On Wed, Nov 09, 2005 at 05:52:44PM -0800, Rohit Seth wrote: >> lazy_mmu_prot_update will need to called here to make caches coherent >> for some archs. On Thu, Nov 10, 2005 at 02:54:03PM +1100, David Gibson wrote: > Ah, yes indeed. Revised version below. While I was at it, I moved > set_huge_ptep_writable() into mm/hugetlb.c, since there's no actual > need for it to be in the .h, and abolished huge_ptep_set_wrprotect() > since there's no need for the macro at all. > Hugetlb: Copy on Write support Re-acking. Good catch, thanks Rohit. Acked-by: William Irwin <wli@holomorphy.com> -- wli ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/4] Hugetlb: Copy on Write support @ 2005-11-10 4:20 ` William Lee Irwin III 0 siblings, 0 replies; 28+ messages in thread From: William Lee Irwin III @ 2005-11-10 4:20 UTC (permalink / raw) To: David Gibson Cc: Rohit Seth, Adam Litke, akpm, linux-mm, linux-kernel, hugh, kenneth.w.chen On Wed, Nov 09, 2005 at 05:52:44PM -0800, Rohit Seth wrote: >> lazy_mmu_prot_update will need to called here to make caches coherent >> for some archs. On Thu, Nov 10, 2005 at 02:54:03PM +1100, David Gibson wrote: > Ah, yes indeed. Revised version below. While I was at it, I moved > set_huge_ptep_writable() into mm/hugetlb.c, since there's no actual > need for it to be in the .h, and abolished huge_ptep_set_wrprotect() > since there's no need for the macro at all. > Hugetlb: Copy on Write support Re-acking. Good catch, thanks Rohit. Acked-by: William Irwin <wli@holomorphy.com> -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2005-11-10 4:22 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-11-09 23:28 [PATCH 0/4] hugetlb: copy on write Adam Litke 2005-11-09 23:28 ` Adam Litke 2005-11-09 23:36 ` [PATCH 1/4] Hugetlb: Remove duplicate i_size check Adam Litke 2005-11-09 23:36 ` Adam Litke 2005-11-10 0:10 ` William Lee Irwin III 2005-11-10 0:10 ` William Lee Irwin III 2005-11-09 23:37 ` [PATCH 2/4] Hugetlb: Rename find_lock_page to find_or_alloc_huge_page Adam Litke 2005-11-09 23:37 ` Adam Litke 2005-11-10 0:11 ` William Lee Irwin III 2005-11-10 0:11 ` William Lee Irwin III 2005-11-09 23:38 ` [PATCH 3/4] Hugetlb: Reorganize hugetlb_fault to prepare for COW Adam Litke 2005-11-09 23:38 ` Adam Litke 2005-11-10 0:13 ` William Lee Irwin III 2005-11-10 0:13 ` William Lee Irwin III 2005-11-09 23:39 ` [PATCH 4/4] Hugetlb: Copy on Write support Adam Litke 2005-11-09 23:39 ` Adam Litke 2005-11-10 0:15 ` William Lee Irwin III 2005-11-10 0:15 ` William Lee Irwin III 2005-11-10 0:49 ` David Gibson 2005-11-10 0:49 ` David Gibson 2005-11-10 0:56 ` William Lee Irwin III 2005-11-10 0:56 ` William Lee Irwin III 2005-11-10 1:52 ` Rohit Seth 2005-11-10 1:52 ` Rohit Seth 2005-11-10 3:54 ` David Gibson 2005-11-10 3:54 ` David Gibson 2005-11-10 4:20 ` William Lee Irwin III 2005-11-10 4:20 ` William Lee Irwin III
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.