* [PATCH 0/3] Guarantee faults for processes that call mmap(MAP_PRIVATE) on hugetlbfs v4
@ 2008-05-27 18:50 Mel Gorman
2008-05-27 18:50 ` [PATCH 1/3] Move hugetlb_acct_memory() Mel Gorman
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Mel Gorman @ 2008-05-27 18:50 UTC (permalink / raw)
To: akpm
Cc: Mel Gorman, dean, apw, linux-kernel, wli, dwg, linux-mm, andi,
kenchen, agl, abh
Hi Andrew,
This is a patchset to give reliable behaviour to a process that successfully
calls mmap(MAP_PRIVATE) on a hugetlbfs file. Currently, it is possible for
the process to be killed due to a small hugepage pool size even if it calls
mlock(). More details are below. There have been no objections made in a
while and I believe it's ready for wider testing. People are cc'd just in
case minds have changed since. Thanks
Changelog since V3
o Differeniate between a shared pagecache page and a shared parent/child page.
Without the check, a BUG is triggered when an existing hugetlbfs file is
mapped MAP_PRIVATE and the pool is too small.
Changelog since V2
o Rebase to 2.6.26-rc2-mm1
o Document when hugetlb_lock is held for reserve counter updates
o Add vma_has_private_reserves() helper for clarity
MAP_SHARED mappings on hugetlbfs reserve huge pages at mmap() time.
This guarantees all future faults against the mapping will succeed.
This allows local allocations at first use improving NUMA locality whilst
retaining reliability.
MAP_PRIVATE mappings do not reserve pages. This can result in an application
being SIGKILLed later if a huge page is not available at fault time. This
makes huge pages usage very ill-advised in some cases as the unexpected
application failure cannot be detected and handled as it is immediately fatal.
Although an application may force instantiation of the pages using mlock(),
this may lead to poor memory placement and the process may still be killed
when performing COW.
This patchset introduces a reliability guarantee for the process which creates
a private mapping, i.e. the process that calls mmap() on a hugetlbfs file
successfully. The first patch of the set is purely mechanical code move to
make later diffs easier to read. The second patch will guarantee faults up
until the process calls fork(). After patch two, as long as the child keeps
the mappings, the parent is no longer guaranteed to be reliable. Patch
3 guarantees that the parent will always successfully COW by unmapping
the pages from the child in the event there are insufficient pages in the
hugepage pool in allocate a new page, be it via a static or dynamic pool.
Existing hugepage-aware applications are unlikely to be affected by this
change. For much of hugetlbfs's history, pages were pre-faulted at mmap()
time or mmap() failed which acts in a reserve-like manner. If the pool
is sized correctly already so that parent and child can fault reliably,
the application will not even notice the reserves. It's only when the pool
is too small for the application to function perfectly reliably that the
reserves come into play.
Credit goes to Andy Whitcroft for cleaning up a number of mistakes during
review before the patches were released.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 1/3] Move hugetlb_acct_memory() 2008-05-27 18:50 [PATCH 0/3] Guarantee faults for processes that call mmap(MAP_PRIVATE) on hugetlbfs v4 Mel Gorman @ 2008-05-27 18:50 ` Mel Gorman 2008-05-28 13:37 ` Adam Litke 2008-05-27 18:51 ` [PATCH 2/3] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings until fork() Mel Gorman 2008-05-27 18:51 ` [PATCH 3/3] Guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed Mel Gorman 2 siblings, 1 reply; 16+ messages in thread From: Mel Gorman @ 2008-05-27 18:50 UTC (permalink / raw) To: akpm Cc: Mel Gorman, dean, linux-kernel, wli, dwg, agl, linux-mm, andi, kenchen, apw, abh A later patch in this set needs to call hugetlb_acct_memory() before it is defined. This patch moves the function without modification. This makes later diffs easier to read. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/hugetlb.c | 82 +++++++++++++++++++++++++++--------------------------- 1 file changed, 41 insertions(+), 41 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc2-mm1-clean/mm/hugetlb.c linux-2.6.26-rc2-mm1-0010-move-hugetlb_acct_memory/mm/hugetlb.c --- linux-2.6.26-rc2-mm1-clean/mm/hugetlb.c 2008-05-27 10:25:54.000000000 +0100 +++ linux-2.6.26-rc2-mm1-0010-move-hugetlb_acct_memory/mm/hugetlb.c 2008-05-27 10:26:13.000000000 +0100 @@ -716,6 +716,47 @@ unsigned long hugetlb_total_pages(void) return nr_huge_pages * (HPAGE_SIZE / PAGE_SIZE); } +static int hugetlb_acct_memory(long delta) +{ + int ret = -ENOMEM; + + spin_lock(&hugetlb_lock); + /* + * When cpuset is configured, it breaks the strict hugetlb page + * reservation as the accounting is done on a global variable. Such + * reservation is completely rubbish in the presence of cpuset because + * the reservation is not checked against page availability for the + * current cpuset. Application can still potentially OOM'ed by kernel + * with lack of free htlb page in cpuset that the task is in. + * Attempt to enforce strict accounting with cpuset is almost + * impossible (or too ugly) because cpuset is too fluid that + * task or memory node can be dynamically moved between cpusets. + * + * The change of semantics for shared hugetlb mapping with cpuset is + * undesirable. However, in order to preserve some of the semantics, + * we fall back to check against current free page availability as + * a best attempt and hopefully to minimize the impact of changing + * semantics that cpuset has. + */ + if (delta > 0) { + if (gather_surplus_pages(delta) < 0) + goto out; + + if (delta > cpuset_mems_nr(free_huge_pages_node)) { + return_unused_surplus_pages(delta); + goto out; + } + } + + ret = 0; + if (delta < 0) + return_unused_surplus_pages((unsigned long) -delta); + +out: + spin_unlock(&hugetlb_lock); + return ret; +} + /* * We cannot handle pagefaults against hugetlb pages at all. They cause * handle_mm_fault() to try to instantiate regular-sized pages in the @@ -1248,47 +1289,6 @@ static long region_truncate(struct list_ return chg; } -static int hugetlb_acct_memory(long delta) -{ - int ret = -ENOMEM; - - spin_lock(&hugetlb_lock); - /* - * When cpuset is configured, it breaks the strict hugetlb page - * reservation as the accounting is done on a global variable. Such - * reservation is completely rubbish in the presence of cpuset because - * the reservation is not checked against page availability for the - * current cpuset. Application can still potentially OOM'ed by kernel - * with lack of free htlb page in cpuset that the task is in. - * Attempt to enforce strict accounting with cpuset is almost - * impossible (or too ugly) because cpuset is too fluid that - * task or memory node can be dynamically moved between cpusets. - * - * The change of semantics for shared hugetlb mapping with cpuset is - * undesirable. However, in order to preserve some of the semantics, - * we fall back to check against current free page availability as - * a best attempt and hopefully to minimize the impact of changing - * semantics that cpuset has. - */ - if (delta > 0) { - if (gather_surplus_pages(delta) < 0) - goto out; - - if (delta > cpuset_mems_nr(free_huge_pages_node)) { - return_unused_surplus_pages(delta); - goto out; - } - } - - ret = 0; - if (delta < 0) - return_unused_surplus_pages((unsigned long) -delta); - -out: - spin_unlock(&hugetlb_lock); - return ret; -} - int hugetlb_reserve_pages(struct inode *inode, long from, long to) { long ret, chg; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3] Move hugetlb_acct_memory() 2008-05-27 18:50 ` [PATCH 1/3] Move hugetlb_acct_memory() Mel Gorman @ 2008-05-28 13:37 ` Adam Litke 0 siblings, 0 replies; 16+ messages in thread From: Adam Litke @ 2008-05-28 13:37 UTC (permalink / raw) To: Mel Gorman Cc: akpm, dean, linux-kernel, wli, dwg, linux-mm, andi, kenchen, apw, abh On Tue, 2008-05-27 at 19:50 +0100, Mel Gorman wrote: > A later patch in this set needs to call hugetlb_acct_memory() before it > is defined. This patch moves the function without modification. This makes > later diffs easier to read. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Adam Litke <agl@us.ibm.com> -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 2/3] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings until fork() 2008-05-27 18:50 [PATCH 0/3] Guarantee faults for processes that call mmap(MAP_PRIVATE) on hugetlbfs v4 Mel Gorman 2008-05-27 18:50 ` [PATCH 1/3] Move hugetlb_acct_memory() Mel Gorman @ 2008-05-27 18:51 ` Mel Gorman 2008-05-28 13:52 ` Adam Litke 2008-05-27 18:51 ` [PATCH 3/3] Guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed Mel Gorman 2 siblings, 1 reply; 16+ messages in thread From: Mel Gorman @ 2008-05-27 18:51 UTC (permalink / raw) To: akpm Cc: Mel Gorman, abh, dean, linux-kernel, wli, dwg, linux-mm, andi, kenchen, agl, apw This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in a similar manner to the reservations taken for MAP_SHARED mappings. The reserve count is accounted both globally and on a per-VMA basis for private mappings. This guarantees that a process that successfully calls mmap() will successfully fault all pages in the future unless fork() is called. The characteristics of private mappings of hugetlbfs files behaviour after this patch are; 1. The process calling mmap() is guaranteed to succeed all future faults until it forks(). 2. On fork(), the parent may die due to SIGKILL on writes to the private mapping if enough pages are not available for the COW. For reasonably reliable behaviour in the face of a small huge page pool, children of hugepage-aware processes should not reference the mappings; such as might occur when fork()ing to exec(). 3. On fork(), the child VMAs inherit no reserves. Reads on pages already faulted by the parent will succeed. Successful writes will depend on enough huge pages being free in the pool. 4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper and at fault time otherwise. Before this patch, all reads or writes in the child potentially needs page allocations that can later lead to the death of the parent. This applies to reads and writes of uninstantiated pages as well as COW. After the patch it is only a write to an instantiated page that causes problems. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- fs/hugetlbfs/inode.c | 8 +- include/linux/hugetlb.h | 9 ++ kernel/fork.c | 9 ++ mm/hugetlb.c | 158 ++++++++++++++++++++++++++++++++----------- 4 files changed, 140 insertions(+), 44 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc2-mm1-0010-move-hugetlb_acct_memory/fs/hugetlbfs/inode.c linux-2.6.26-rc2-mm1-0020-map_private_reserve/fs/hugetlbfs/inode.c --- linux-2.6.26-rc2-mm1-0010-move-hugetlb_acct_memory/fs/hugetlbfs/inode.c 2008-05-12 01:09:41.000000000 +0100 +++ linux-2.6.26-rc2-mm1-0020-map_private_reserve/fs/hugetlbfs/inode.c 2008-05-27 10:26:22.000000000 +0100 @@ -103,9 +103,9 @@ static int hugetlbfs_file_mmap(struct fi ret = -ENOMEM; len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); - if (vma->vm_flags & VM_MAYSHARE && - hugetlb_reserve_pages(inode, vma->vm_pgoff >> (HPAGE_SHIFT-PAGE_SHIFT), - len >> HPAGE_SHIFT)) + if (hugetlb_reserve_pages(inode, + vma->vm_pgoff >> (HPAGE_SHIFT-PAGE_SHIFT), + len >> HPAGE_SHIFT, vma)) goto out; ret = 0; @@ -942,7 +942,7 @@ struct file *hugetlb_file_setup(const ch goto out_dentry; error = -ENOMEM; - if (hugetlb_reserve_pages(inode, 0, size >> HPAGE_SHIFT)) + if (hugetlb_reserve_pages(inode, 0, size >> HPAGE_SHIFT, NULL)) goto out_inode; d_instantiate(dentry, inode); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc2-mm1-0010-move-hugetlb_acct_memory/include/linux/hugetlb.h linux-2.6.26-rc2-mm1-0020-map_private_reserve/include/linux/hugetlb.h --- linux-2.6.26-rc2-mm1-0010-move-hugetlb_acct_memory/include/linux/hugetlb.h 2008-05-12 01:09:41.000000000 +0100 +++ linux-2.6.26-rc2-mm1-0020-map_private_reserve/include/linux/hugetlb.h 2008-05-27 10:26:22.000000000 +0100 @@ -17,6 +17,7 @@ static inline int is_vm_hugetlb_page(str return vma->vm_flags & VM_HUGETLB; } +void reset_vma_resv_huge_pages(struct vm_area_struct *vma); int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int hugetlb_overcommit_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); @@ -30,7 +31,8 @@ int hugetlb_report_node_meminfo(int, cha unsigned long hugetlb_total_pages(void); int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access); -int hugetlb_reserve_pages(struct inode *inode, long from, long to); +int hugetlb_reserve_pages(struct inode *inode, long from, long to, + struct vm_area_struct *vma); void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed); extern unsigned long max_huge_pages; @@ -58,6 +60,11 @@ static inline int is_vm_hugetlb_page(str { return 0; } + +static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma) +{ +} + static inline unsigned long hugetlb_total_pages(void) { return 0; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc2-mm1-0010-move-hugetlb_acct_memory/kernel/fork.c linux-2.6.26-rc2-mm1-0020-map_private_reserve/kernel/fork.c --- linux-2.6.26-rc2-mm1-0010-move-hugetlb_acct_memory/kernel/fork.c 2008-05-27 10:25:54.000000000 +0100 +++ linux-2.6.26-rc2-mm1-0020-map_private_reserve/kernel/fork.c 2008-05-27 10:26:22.000000000 +0100 @@ -54,6 +54,7 @@ #include <linux/tty.h> #include <linux/proc_fs.h> #include <linux/blkdev.h> +#include <linux/hugetlb.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -306,6 +307,14 @@ static int dup_mmap(struct mm_struct *mm } /* + * Clear hugetlb-related page reserves for children. This only + * affects MAP_PRIVATE mappings. Faults generated by the child + * are not guaranteed to succeed, even if read-only + */ + if (is_vm_hugetlb_page(tmp)) + reset_vma_resv_huge_pages(tmp); + + /* * Link in the new vma and copy the page table entries. */ *pprev = tmp; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc2-mm1-0010-move-hugetlb_acct_memory/mm/hugetlb.c linux-2.6.26-rc2-mm1-0020-map_private_reserve/mm/hugetlb.c --- linux-2.6.26-rc2-mm1-0010-move-hugetlb_acct_memory/mm/hugetlb.c 2008-05-27 10:26:13.000000000 +0100 +++ linux-2.6.26-rc2-mm1-0020-map_private_reserve/mm/hugetlb.c 2008-05-27 10:26:22.000000000 +0100 @@ -40,6 +40,69 @@ static int hugetlb_next_nid; */ static DEFINE_SPINLOCK(hugetlb_lock); +/* + * These helpers are used to track how many pages are reserved for + * faults in a MAP_PRIVATE mapping. Only the process that called mmap() + * is guaranteed to have their future faults succeed. + * + * With the exception of reset_vma_resv_huge_pages() which is called at fork(), + * the reserve counters are updated with the hugetlb_lock held. It is safe + * to reset the VMA at fork() time as it is not in use yet and there is no + * chance of the global counters getting corrupted as a result of the values. + */ +static unsigned long vma_resv_huge_pages(struct vm_area_struct *vma) +{ + VM_BUG_ON(!is_vm_hugetlb_page(vma)); + if (!(vma->vm_flags & VM_SHARED)) + return (unsigned long)vma->vm_private_data; + return 0; +} + +static void set_vma_resv_huge_pages(struct vm_area_struct *vma, + unsigned long reserve) +{ + VM_BUG_ON(!is_vm_hugetlb_page(vma)); + VM_BUG_ON(vma->vm_flags & VM_SHARED); + + vma->vm_private_data = (void *)reserve; +} + +/* Decrement the reserved pages in the hugepage pool by one */ +static void decrement_hugepage_resv_vma(struct vm_area_struct *vma) +{ + if (vma->vm_flags & VM_SHARED) { + /* Shared mappings always use reserves */ + resv_huge_pages--; + } else { + /* + * Only the process that called mmap() has reserves for + * private mappings. + */ + if (vma_resv_huge_pages(vma)) { + resv_huge_pages--; + reserve = (unsigned long)vma->vm_private_data - 1; + vma->vm_private_data = (void *)reserve; + } + } +} + +void reset_vma_resv_huge_pages(struct vm_area_struct *vma) +{ + VM_BUG_ON(!is_vm_hugetlb_page(vma)); + if (!(vma->vm_flags & VM_SHARED)) + vma->vm_private_data = (void *)0; +} + +/* Returns true if the VMA has associated reserve pages */ +static int vma_has_private_reserves(struct vm_area_struct *vma) +{ + if (vma->vm_flags & VM_SHARED) + return 0; + if (!vma_resv_huge_pages(vma)) + return 0; + return 1; +} + static void clear_huge_page(struct page *page, unsigned long addr) { int i; @@ -101,6 +164,15 @@ static struct page *dequeue_huge_page_vm struct zone *zone; struct zoneref *z; + /* + * A child process with MAP_PRIVATE mappings created by their parent + * have no page reserves. This check ensures that reservations are + * not "stolen". The child may still get SIGKILLed + */ + if (!vma_has_private_reserves(vma) && + free_huge_pages - resv_huge_pages == 0) + return NULL; + for_each_zone_zonelist_nodemask(zone, z, zonelist, MAX_NR_ZONES - 1, nodemask) { nid = zone_to_nid(zone); @@ -111,8 +183,8 @@ static struct page *dequeue_huge_page_vm list_del(&page->lru); free_huge_pages--; free_huge_pages_node[nid]--; - if (vma && vma->vm_flags & VM_MAYSHARE) - resv_huge_pages--; + decrement_hugepage_resv_vma(vma); + break; } } @@ -461,55 +533,40 @@ static void return_unused_surplus_pages( } } - -static struct page *alloc_huge_page_shared(struct vm_area_struct *vma, - unsigned long addr) +static struct page *alloc_huge_page(struct vm_area_struct *vma, + unsigned long addr) { struct page *page; + struct address_space *mapping = vma->vm_file->f_mapping; + struct inode *inode = mapping->host; + unsigned int chg = 0; + + /* + * Processes that did not create the mapping will have no reserves and + * will not have accounted against quota. Check that the quota can be + * made before satisfying the allocation + */ + if (!vma_has_private_reserves(vma)) { + chg = 1; + if (hugetlb_get_quota(inode->i_mapping, chg)) + return ERR_PTR(-ENOSPC); + } spin_lock(&hugetlb_lock); page = dequeue_huge_page_vma(vma, addr); spin_unlock(&hugetlb_lock); - return page ? page : ERR_PTR(-VM_FAULT_OOM); -} -static struct page *alloc_huge_page_private(struct vm_area_struct *vma, - unsigned long addr) -{ - struct page *page = NULL; - - if (hugetlb_get_quota(vma->vm_file->f_mapping, 1)) - return ERR_PTR(-VM_FAULT_SIGBUS); - - spin_lock(&hugetlb_lock); - if (free_huge_pages > resv_huge_pages) - page = dequeue_huge_page_vma(vma, addr); - spin_unlock(&hugetlb_lock); if (!page) { page = alloc_buddy_huge_page(vma, addr); if (!page) { - hugetlb_put_quota(vma->vm_file->f_mapping, 1); + hugetlb_put_quota(inode->i_mapping, chg); return ERR_PTR(-VM_FAULT_OOM); } } - return page; -} -static struct page *alloc_huge_page(struct vm_area_struct *vma, - unsigned long addr) -{ - struct page *page; - struct address_space *mapping = vma->vm_file->f_mapping; - - if (vma->vm_flags & VM_MAYSHARE) - page = alloc_huge_page_shared(vma, addr); - else - page = alloc_huge_page_private(vma, addr); + set_page_refcounted(page); + set_page_private(page, (unsigned long) mapping); - if (!IS_ERR(page)) { - set_page_refcounted(page); - set_page_private(page, (unsigned long) mapping); - } return page; } @@ -757,6 +814,13 @@ out: return ret; } +static void hugetlb_vm_op_close(struct vm_area_struct *vma) +{ + unsigned long reserve = vma_resv_huge_pages(vma); + if (reserve) + hugetlb_acct_memory(-reserve); +} + /* * We cannot handle pagefaults against hugetlb pages at all. They cause * handle_mm_fault() to try to instantiate regular-sized pages in the @@ -771,6 +835,7 @@ static int hugetlb_vm_op_fault(struct vm struct vm_operations_struct hugetlb_vm_ops = { .fault = hugetlb_vm_op_fault, + .close = hugetlb_vm_op_close, }; static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page, @@ -1289,11 +1354,25 @@ static long region_truncate(struct list_ return chg; } -int hugetlb_reserve_pages(struct inode *inode, long from, long to) +int hugetlb_reserve_pages(struct inode *inode, + long from, long to, + struct vm_area_struct *vma) { long ret, chg; - chg = region_chg(&inode->i_mapping->private_list, from, to); + /* + * Shared mappings base their reservation on the number of pages that + * are already allocated on behalf of the file. Private mappings need + * to reserve the full area even if read-only as mprotect() may be + * called to make the mapping read-write. Assume !vma is a shm mapping + */ + if (!vma || vma->vm_flags & VM_SHARED) + chg = region_chg(&inode->i_mapping->private_list, from, to); + else { + chg = to - from; + set_vma_resv_huge_pages(vma, chg); + } + if (chg < 0) return chg; @@ -1304,7 +1383,8 @@ int hugetlb_reserve_pages(struct inode * hugetlb_put_quota(inode->i_mapping, chg); return ret; } - region_add(&inode->i_mapping->private_list, from, to); + if (!vma || vma->vm_flags & VM_SHARED) + region_add(&inode->i_mapping->private_list, from, to); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings until fork() 2008-05-27 18:51 ` [PATCH 2/3] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings until fork() Mel Gorman @ 2008-05-28 13:52 ` Adam Litke 0 siblings, 0 replies; 16+ messages in thread From: Adam Litke @ 2008-05-28 13:52 UTC (permalink / raw) To: Mel Gorman Cc: akpm, abh, dean, linux-kernel, wli, dwg, linux-mm, andi, kenchen, apw On Tue, 2008-05-27 at 19:51 +0100, Mel Gorman wrote: > This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in a > similar manner to the reservations taken for MAP_SHARED mappings. The reserve count is > accounted both globally and on a per-VMA basis for private mappings. This > guarantees that a process that successfully calls mmap() will successfully > fault all pages in the future unless fork() is called. > > The characteristics of private mappings of hugetlbfs files behaviour after > this patch are; > > 1. The process calling mmap() is guaranteed to succeed all future faults until > it forks(). > 2. On fork(), the parent may die due to SIGKILL on writes to the private > mapping if enough pages are not available for the COW. For reasonably > reliable behaviour in the face of a small huge page pool, children of > hugepage-aware processes should not reference the mappings; such as > might occur when fork()ing to exec(). > 3. On fork(), the child VMAs inherit no reserves. Reads on pages already > faulted by the parent will succeed. Successful writes will depend on enough > huge pages being free in the pool. > 4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper > and at fault time otherwise. > > Before this patch, all reads or writes in the child potentially needs page > allocations that can later lead to the death of the parent. This applies > to reads and writes of uninstantiated pages as well as COW. After the > patch it is only a write to an instantiated page that causes problems. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Adam Litke <agl@us.ibm.com> -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 3/3] Guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed 2008-05-27 18:50 [PATCH 0/3] Guarantee faults for processes that call mmap(MAP_PRIVATE) on hugetlbfs v4 Mel Gorman 2008-05-27 18:50 ` [PATCH 1/3] Move hugetlb_acct_memory() Mel Gorman 2008-05-27 18:51 ` [PATCH 2/3] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings until fork() Mel Gorman @ 2008-05-27 18:51 ` Mel Gorman 2008-05-28 16:00 ` Mel Gorman ` (2 more replies) 2 siblings, 3 replies; 16+ messages in thread From: Mel Gorman @ 2008-05-27 18:51 UTC (permalink / raw) To: akpm Cc: Mel Gorman, dean, linux-kernel, wli, dwg, apw, linux-mm, andi, kenchen, agl, abh After patch 2 in this series, a process that successfully calls mmap() for a MAP_PRIVATE mapping will be guaranteed to successfully fault until a process calls fork(). At that point, the next write fault from the parent could fail due to COW if the child still has a reference. We only reserve pages for the parent but a copy must be made to avoid leaking data from the parent to the child after fork(). Reserves could be taken for both parent and child at fork time to guarantee faults but if the mapping is large it is highly likely we will not have sufficient pages for the reservation, and it is common to fork only to exec() immediatly after. A failure here would be very undesirable. Note that the current behaviour of mainline with MAP_PRIVATE pages is pretty bad. The following situation is allowed to occur today. 1. Process calls mmap(MAP_PRIVATE) 2. Process calls mlock() to fault all pages and makes sure it succeeds 3. Process forks() 4. Process writes to MAP_PRIVATE mapping while child still exists 5. If the COW fails at this point, the process gets SIGKILLed even though it had taken care to ensure the pages existed This patch improves the situation by guaranteeing the reliability of the process that successfully calls mmap(). When the parent performs COW, it will try to satisfy the allocation without using reserves. If that fails the parent will steal the page leaving any children without a page. Faults from the child after that point will result in failure. If the child COW happens first, an attempt will be made to allocate the page without reserves and the child will get SIGKILLed on failure. To summarise the new behaviour: 1. If the original mapper performs COW on a private mapping with multiple references, it will attempt to allocate a hugepage from the pool or the buddy allocator without using the existing reserves. On fail, VMAs mapping the same area are traversed and the page being COW'd is unmapped where found. It will then steal the original page as the last mapper in the normal way. 2. The VMAs the pages were unmapped from are flagged to note that pages with data no longer exist. Future no-page faults on those VMAs will terminate the process as otherwise it would appear that data was corrupted. A warning is printed to the console that this situation occured. 2. If the child performs COW first, it will attempt to satisfy the COW from the pool if there are enough pages or via the buddy allocator if overcommit is allowed and the buddy allocator can satisfy the request. If it fails, the child will be killed. If the pool is large enough, existing applications will not notice that the reserves were a factor. Existing applications depending on the no-reserves been set are unlikely to exist as for much of the history of hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that point or failing the mmap(). Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- fs/hugetlbfs/inode.c | 2 include/linux/hugetlb.h | 6 - mm/hugetlb.c | 196 +++++++++++++++++++++++++++++++++++++++---- mm/memory.c | 2 4 files changed, 184 insertions(+), 22 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc2-mm1-0020-map_private_reserve/fs/hugetlbfs/inode.c linux-2.6.26-rc2-mm1-0030-reliable_parent_faults/fs/hugetlbfs/inode.c --- linux-2.6.26-rc2-mm1-0020-map_private_reserve/fs/hugetlbfs/inode.c 2008-05-27 10:26:22.000000000 +0100 +++ linux-2.6.26-rc2-mm1-0030-reliable_parent_faults/fs/hugetlbfs/inode.c 2008-05-27 12:59:20.000000000 +0100 @@ -441,7 +441,7 @@ hugetlb_vmtruncate_list(struct prio_tree v_offset = 0; __unmap_hugepage_range(vma, - vma->vm_start + v_offset, vma->vm_end); + vma->vm_start + v_offset, vma->vm_end, NULL); } } diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc2-mm1-0020-map_private_reserve/include/linux/hugetlb.h linux-2.6.26-rc2-mm1-0030-reliable_parent_faults/include/linux/hugetlb.h --- linux-2.6.26-rc2-mm1-0020-map_private_reserve/include/linux/hugetlb.h 2008-05-27 10:26:22.000000000 +0100 +++ linux-2.6.26-rc2-mm1-0030-reliable_parent_faults/include/linux/hugetlb.h 2008-05-27 12:59:20.000000000 +0100 @@ -23,8 +23,10 @@ int hugetlb_overcommit_handler(struct ct int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int); -void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long); -void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long); +void unmap_hugepage_range(struct vm_area_struct *, + unsigned long, unsigned long, struct page *); +void __unmap_hugepage_range(struct vm_area_struct *, + unsigned long, unsigned long, struct page *); int hugetlb_prefault(struct address_space *, struct vm_area_struct *); int hugetlb_report_meminfo(char *); int hugetlb_report_node_meminfo(int, char *); diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc2-mm1-0020-map_private_reserve/mm/hugetlb.c linux-2.6.26-rc2-mm1-0030-reliable_parent_faults/mm/hugetlb.c --- linux-2.6.26-rc2-mm1-0020-map_private_reserve/mm/hugetlb.c 2008-05-27 10:26:22.000000000 +0100 +++ linux-2.6.26-rc2-mm1-0030-reliable_parent_faults/mm/hugetlb.c 2008-05-27 15:25:09.000000000 +0100 @@ -40,6 +40,9 @@ static int hugetlb_next_nid; */ static DEFINE_SPINLOCK(hugetlb_lock); +#define HPAGE_RESV_OWNER (1UL << (BITS_PER_LONG - 1)) +#define HPAGE_RESV_UNMAPPED (1UL << (BITS_PER_LONG - 2)) +#define HPAGE_RESV_MASK (HPAGE_RESV_OWNER | HPAGE_RESV_UNMAPPED) /* * These helpers are used to track how many pages are reserved for * faults in a MAP_PRIVATE mapping. Only the process that called mmap() @@ -54,17 +57,32 @@ static unsigned long vma_resv_huge_pages { VM_BUG_ON(!is_vm_hugetlb_page(vma)); if (!(vma->vm_flags & VM_SHARED)) - return (unsigned long)vma->vm_private_data; + return (unsigned long)vma->vm_private_data & ~HPAGE_RESV_MASK; return 0; } static void set_vma_resv_huge_pages(struct vm_area_struct *vma, unsigned long reserve) { + unsigned long flags; VM_BUG_ON(!is_vm_hugetlb_page(vma)); VM_BUG_ON(vma->vm_flags & VM_SHARED); - vma->vm_private_data = (void *)reserve; + flags = (unsigned long)vma->vm_private_data & HPAGE_RESV_MASK; + vma->vm_private_data = (void *)(reserve | flags); +} + +static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags) +{ + unsigned long reserveflags = (unsigned long)vma->vm_private_data; + VM_BUG_ON(!is_vm_hugetlb_page(vma)); + vma->vm_private_data = (void *)(reserveflags | flags); +} + +static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag) +{ + VM_BUG_ON(!is_vm_hugetlb_page(vma)); + return ((unsigned long)vma->vm_private_data & flag) != 0; } /* Decrement the reserved pages in the hugepage pool by one */ @@ -78,14 +96,18 @@ static void decrement_hugepage_resv_vma( * Only the process that called mmap() has reserves for * private mappings. */ - if (vma_resv_huge_pages(vma)) { + if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { + unsigned long flags, reserve; resv_huge_pages--; + flags = (unsigned long)vma->vm_private_data & + HPAGE_RESV_MASK; reserve = (unsigned long)vma->vm_private_data - 1; - vma->vm_private_data = (void *)reserve; + vma->vm_private_data = (void *)(reserve | flags); } } } +/* Reset counters to 0 and clear all HPAGE_RESV_* flags */ void reset_vma_resv_huge_pages(struct vm_area_struct *vma) { VM_BUG_ON(!is_vm_hugetlb_page(vma)); @@ -153,7 +175,7 @@ static struct page *dequeue_huge_page(vo } static struct page *dequeue_huge_page_vma(struct vm_area_struct *vma, - unsigned long address) + unsigned long address, int avoid_reserve) { int nid; struct page *page = NULL; @@ -173,6 +195,10 @@ static struct page *dequeue_huge_page_vm free_huge_pages - resv_huge_pages == 0) return NULL; + /* If reserves cannot be used, ensure enough pages are in the pool */ + if (avoid_reserve && free_huge_pages - resv_huge_pages == 0) + return NULL; + for_each_zone_zonelist_nodemask(zone, z, zonelist, MAX_NR_ZONES - 1, nodemask) { nid = zone_to_nid(zone); @@ -183,7 +209,9 @@ static struct page *dequeue_huge_page_vm list_del(&page->lru); free_huge_pages--; free_huge_pages_node[nid]--; - decrement_hugepage_resv_vma(vma); + + if (!avoid_reserve) + decrement_hugepage_resv_vma(vma); break; } @@ -534,7 +562,7 @@ static void return_unused_surplus_pages( } static struct page *alloc_huge_page(struct vm_area_struct *vma, - unsigned long addr) + unsigned long addr, int avoid_reserve) { struct page *page; struct address_space *mapping = vma->vm_file->f_mapping; @@ -546,14 +574,15 @@ static struct page *alloc_huge_page(stru * will not have accounted against quota. Check that the quota can be * made before satisfying the allocation */ - if (!vma_has_private_reserves(vma)) { + if (!(vma->vm_flags & VM_SHARED) && + !is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { chg = 1; if (hugetlb_get_quota(inode->i_mapping, chg)) return ERR_PTR(-ENOSPC); } spin_lock(&hugetlb_lock); - page = dequeue_huge_page_vma(vma, addr); + page = dequeue_huge_page_vma(vma, addr, avoid_reserve); spin_unlock(&hugetlb_lock); if (!page) { @@ -909,7 +938,7 @@ nomem: } void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, - unsigned long end) + unsigned long end, struct page *ref_page) { struct mm_struct *mm = vma->vm_mm; unsigned long address; @@ -937,6 +966,27 @@ void __unmap_hugepage_range(struct vm_ar if (huge_pmd_unshare(mm, &address, ptep)) continue; + /* + * If a reference page is supplied, it is because a specific + * page is being unmapped, not a range. Ensure the page we + * are about to unmap is the actual page of interest. + */ + if (ref_page) { + pte = huge_ptep_get(ptep); + if (huge_pte_none(pte)) + continue; + page = pte_page(pte); + if (page != ref_page) + continue; + + /* + * Mark the VMA as having unmapped its page so that + * future faults in this VMA will fail rather than + * looking like data was lost + */ + set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED); + } + pte = huge_ptep_get_and_clear(mm, address, ptep); if (huge_pte_none(pte)) continue; @@ -955,7 +1005,7 @@ void __unmap_hugepage_range(struct vm_ar } void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, - unsigned long end) + unsigned long end, struct page *ref_page) { /* * It is undesirable to test vma->vm_file as it should be non-null @@ -967,19 +1017,63 @@ void unmap_hugepage_range(struct vm_area */ if (vma->vm_file) { spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); - __unmap_hugepage_range(vma, start, end); + __unmap_hugepage_range(vma, start, end, ref_page); spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock); } } +/* + * This is called when the original mapper is failing to COW a MAP_PRIVATE + * mappping it owns the reserve page for. The intention is to unmap the page + * from other VMAs and let the children be SIGKILLed if they are faulting the + * same region. + */ +int unmap_ref_private(struct mm_struct *mm, + struct vm_area_struct *vma, + struct page *page, + unsigned long address) +{ + struct vm_area_struct *iter_vma; + struct address_space *mapping; + pgoff_t pgoff = ((address - vma->vm_start) >> HPAGE_SHIFT) + + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); + struct prio_tree_iter iter; + + mapping = (struct address_space *)page_private(page); + vma_prio_tree_foreach(iter_vma, &iter, &mapping->i_mmap, pgoff, pgoff) { + BUG_ON(vma->vm_start != iter_vma->vm_start); + + /* Do not unmap the current VMA */ + if (iter_vma == vma) + continue; + + /* + * Unmap the page from other VMAs without their own reserves. + * They get marked to be SIGKILLed if they fault in these + * areas. This is because a future no-page fault on this VMA + * could insert a zeroed page instead of the data existing + * from the time of fork. This would look like data corruption + */ + if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER)) + unmap_hugepage_range(iter_vma, + address, address + HPAGE_SIZE, + page); + } + + return 1; +} + static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, pte_t *ptep, pte_t pte) + unsigned long address, pte_t *ptep, pte_t pte, + struct page *pagecache_page) { struct page *old_page, *new_page; int avoidcopy; + int outside_reserve = 0; old_page = pte_page(pte); +retry_avoidcopy: /* If no-one else is actually using this page, avoid the copy * and just make the page writable */ avoidcopy = (page_count(old_page) == 1); @@ -988,11 +1082,43 @@ static int hugetlb_cow(struct mm_struct return 0; } + /* + * If the process that created a MAP_PRIVATE mapping is about to + * perform a COW due to a shared page count, attempt to satisfy + * the allocation without using the existing reserves. The pagecache + * page is used to determine if the reserve at this address was + * consumed or not. If reserves were used, a partial faulted mapping + * at the time of fork() could consume its reserves on COW instead + * of the full address range. + */ + if (!(vma->vm_flags & VM_SHARED) && + is_vma_resv_set(vma, HPAGE_RESV_OWNER) && + old_page != pagecache_page) + outside_reserve = 1; + page_cache_get(old_page); - new_page = alloc_huge_page(vma, address); + new_page = alloc_huge_page(vma, address, outside_reserve); if (IS_ERR(new_page)) { page_cache_release(old_page); + + /* + * If a process owning a MAP_PRIVATE mapping fails to COW, + * it is due to references held by a child and an insufficient + * huge page pool. To guarantee the original mappers + * reliability, unmap the page from child processes. The child + * may get SIGKILLed if it later faults. + */ + if (outside_reserve) { + BUG_ON(huge_pte_none(pte)); + if (unmap_ref_private(mm, vma, old_page, address)) { + BUG_ON(page_count(old_page) != 1); + BUG_ON(huge_pte_none(pte)); + goto retry_avoidcopy; + } + WARN_ON_ONCE(1); + } + return -PTR_ERR(new_page); } @@ -1015,6 +1141,20 @@ static int hugetlb_cow(struct mm_struct return 0; } +/* Return the pagecache page at a given address within a VMA */ +static struct page *hugetlbfs_pagecache_page(struct vm_area_struct *vma, + unsigned long address) +{ + struct address_space *mapping; + unsigned long idx; + + mapping = vma->vm_file->f_mapping; + idx = ((address - vma->vm_start) >> HPAGE_SHIFT) + + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); + + return find_lock_page(mapping, idx); +} + static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *ptep, int write_access) { @@ -1025,6 +1165,18 @@ static int hugetlb_no_page(struct mm_str struct address_space *mapping; pte_t new_pte; + /* + * Currently, we are forced to kill the process in the event the + * original mapper has unmapped pages from the child due to a failed + * COW. Warn that such a situation has occured as it may not be obvious + */ + if (is_vma_resv_set(vma, HPAGE_RESV_UNMAPPED)) { + printk(KERN_WARNING + "PID %d killed due to inadequate hugepage pool\n", + current->pid); + return ret; + } + mapping = vma->vm_file->f_mapping; idx = ((address - vma->vm_start) >> HPAGE_SHIFT) + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); @@ -1039,7 +1191,7 @@ retry: size = i_size_read(mapping->host) >> HPAGE_SHIFT; if (idx >= size) goto out; - page = alloc_huge_page(vma, address); + page = alloc_huge_page(vma, address, 0); if (IS_ERR(page)) { ret = -PTR_ERR(page); goto out; @@ -1081,7 +1233,7 @@ retry: if (write_access && !(vma->vm_flags & VM_SHARED)) { /* Optimization, do the COW without a second fault */ - ret = hugetlb_cow(mm, vma, address, ptep, new_pte); + ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page); } spin_unlock(&mm->page_table_lock); @@ -1126,8 +1278,15 @@ int hugetlb_fault(struct mm_struct *mm, spin_lock(&mm->page_table_lock); /* Check for a racing update before calling hugetlb_cow */ if (likely(pte_same(entry, huge_ptep_get(ptep)))) - if (write_access && !pte_write(entry)) - ret = hugetlb_cow(mm, vma, address, ptep, entry); + if (write_access && !pte_write(entry)) { + struct page *page; + page = hugetlbfs_pagecache_page(vma, address); + ret = hugetlb_cow(mm, vma, address, ptep, entry, page); + if (page) { + unlock_page(page); + put_page(page); + } + } spin_unlock(&mm->page_table_lock); mutex_unlock(&hugetlb_instantiation_mutex); @@ -1371,6 +1530,7 @@ int hugetlb_reserve_pages(struct inode * else { chg = to - from; set_vma_resv_huge_pages(vma, chg); + set_vma_resv_flags(vma, HPAGE_RESV_OWNER); } if (chg < 0) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc2-mm1-0020-map_private_reserve/mm/memory.c linux-2.6.26-rc2-mm1-0030-reliable_parent_faults/mm/memory.c --- linux-2.6.26-rc2-mm1-0020-map_private_reserve/mm/memory.c 2008-05-27 10:25:54.000000000 +0100 +++ linux-2.6.26-rc2-mm1-0030-reliable_parent_faults/mm/memory.c 2008-05-27 12:59:20.000000000 +0100 @@ -883,7 +883,7 @@ unsigned long unmap_vmas(struct mmu_gath } if (unlikely(is_vm_hugetlb_page(vma))) { - unmap_hugepage_range(vma, start, end); + unmap_hugepage_range(vma, start, end, NULL); zap_work -= (end - start) / (HPAGE_SIZE / PAGE_SIZE); start = end; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] Guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed 2008-05-27 18:51 ` [PATCH 3/3] Guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed Mel Gorman @ 2008-05-28 16:00 ` Mel Gorman 2008-05-28 18:15 ` Adam Litke 2008-05-28 18:16 ` Adam Litke 2008-05-29 1:42 ` Andrew Morton 2 siblings, 1 reply; 16+ messages in thread From: Mel Gorman @ 2008-05-28 16:00 UTC (permalink / raw) To: akpm Cc: dean, linux-kernel, wli, dwg, apw, linux-mm, andi, kenchen, agl, abh, hannes [PATCH 4/3] Fix prio tree lookup I spoke too soon. This is a fix to patch 3/3. If a child unmaps the start of the VMA, the start address is different and that is perfectly legimite making the BUG_ON check bogus and should be removed. While page cache lookups are in HPAGE_SIZE, the vma->vm_pgoff is in PAGE_SIZE units, not HPAGE_SIZE. The offset calculation needs to be in PAGE_SIZE units to find other VMAs that are mapping the same range of pages. This patch fixes the offset calculation and adds an explanation comment as to why it is different from a page cache lookup. Credit goes to Johannes Weiner for spotting the bogus BUG_ON on IRC which led to the discovery of the faulty offset calculation. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/hugetlb.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc2-mm1-0030-reliable_parent_faults/mm/hugetlb.c linux-2.6.26-rc2-mm1-1010_fix_priotree_lookup/mm/hugetlb.c --- linux-2.6.26-rc2-mm1-0030-reliable_parent_faults/mm/hugetlb.c 2008-05-28 14:57:51.000000000 +0100 +++ linux-2.6.26-rc2-mm1-1010_fix_priotree_lookup/mm/hugetlb.c 2008-05-28 15:05:32.000000000 +0100 @@ -1035,14 +1035,18 @@ int unmap_ref_private(struct mm_struct * { struct vm_area_struct *iter_vma; struct address_space *mapping; - pgoff_t pgoff = ((address - vma->vm_start) >> HPAGE_SHIFT) - + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); struct prio_tree_iter iter; + pgoff_t pgoff; + /* + * vm_pgoff is in PAGE_SIZE units, hence the different calculation + * from page cache lookup which is in HPAGE_SIZE units. + */ + pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + + (vma->vm_pgoff >> PAGE_SHIFT); mapping = (struct address_space *)page_private(page); - vma_prio_tree_foreach(iter_vma, &iter, &mapping->i_mmap, pgoff, pgoff) { - BUG_ON(vma->vm_start != iter_vma->vm_start); + vma_prio_tree_foreach(iter_vma, &iter, &mapping->i_mmap, pgoff, pgoff) { /* Do not unmap the current VMA */ if (iter_vma == vma) continue; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] Guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed 2008-05-28 16:00 ` Mel Gorman @ 2008-05-28 18:15 ` Adam Litke 0 siblings, 0 replies; 16+ messages in thread From: Adam Litke @ 2008-05-28 18:15 UTC (permalink / raw) To: Mel Gorman Cc: akpm, dean, linux-kernel, wli, dwg, apw, linux-mm, andi, kenchen, abh, hannes On Wed, 2008-05-28 at 17:00 +0100, Mel Gorman wrote: > [PATCH 4/3] Fix prio tree lookup > > I spoke too soon. This is a fix to patch 3/3. > > If a child unmaps the start of the VMA, the start address is different and > that is perfectly legimite making the BUG_ON check bogus and should be removed. > While page cache lookups are in HPAGE_SIZE, the vma->vm_pgoff is in PAGE_SIZE > units, not HPAGE_SIZE. The offset calculation needs to be in PAGE_SIZE units > to find other VMAs that are mapping the same range of pages. This patch > fixes the offset calculation and adds an explanation comment as to why it > is different from a page cache lookup. > > Credit goes to Johannes Weiner for spotting the bogus BUG_ON on IRC which > led to the discovery of the faulty offset calculation. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Adam Litke <agl@us.ibm.com> -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] Guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed 2008-05-27 18:51 ` [PATCH 3/3] Guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed Mel Gorman 2008-05-28 16:00 ` Mel Gorman @ 2008-05-28 18:16 ` Adam Litke 2008-05-29 1:42 ` Andrew Morton 2 siblings, 0 replies; 16+ messages in thread From: Adam Litke @ 2008-05-28 18:16 UTC (permalink / raw) To: Mel Gorman Cc: akpm, dean, linux-kernel, wli, dwg, apw, linux-mm, andi, kenchen, abh On Tue, 2008-05-27 at 19:51 +0100, Mel Gorman wrote: > After patch 2 in this series, a process that successfully calls mmap() > for a MAP_PRIVATE mapping will be guaranteed to successfully fault until a > process calls fork(). At that point, the next write fault from the parent > could fail due to COW if the child still has a reference. > > We only reserve pages for the parent but a copy must be made to avoid leaking > data from the parent to the child after fork(). Reserves could be taken for > both parent and child at fork time to guarantee faults but if the mapping > is large it is highly likely we will not have sufficient pages for the > reservation, and it is common to fork only to exec() immediatly after. A > failure here would be very undesirable. > > Note that the current behaviour of mainline with MAP_PRIVATE pages is > pretty bad. The following situation is allowed to occur today. > > 1. Process calls mmap(MAP_PRIVATE) > 2. Process calls mlock() to fault all pages and makes sure it succeeds > 3. Process forks() > 4. Process writes to MAP_PRIVATE mapping while child still exists > 5. If the COW fails at this point, the process gets SIGKILLed even though it > had taken care to ensure the pages existed > > This patch improves the situation by guaranteeing the reliability of the > process that successfully calls mmap(). When the parent performs COW, it > will try to satisfy the allocation without using reserves. If that fails the > parent will steal the page leaving any children without a page. Faults from > the child after that point will result in failure. If the child COW happens > first, an attempt will be made to allocate the page without reserves and > the child will get SIGKILLed on failure. > > To summarise the new behaviour: > > 1. If the original mapper performs COW on a private mapping with multiple > references, it will attempt to allocate a hugepage from the pool or > the buddy allocator without using the existing reserves. On fail, VMAs > mapping the same area are traversed and the page being COW'd is unmapped > where found. It will then steal the original page as the last mapper in > the normal way. > > 2. The VMAs the pages were unmapped from are flagged to note that pages > with data no longer exist. Future no-page faults on those VMAs will > terminate the process as otherwise it would appear that data was corrupted. > A warning is printed to the console that this situation occured. > > 2. If the child performs COW first, it will attempt to satisfy the COW > from the pool if there are enough pages or via the buddy allocator if > overcommit is allowed and the buddy allocator can satisfy the request. If > it fails, the child will be killed. > > If the pool is large enough, existing applications will not notice that the > reserves were a factor. Existing applications depending on the no-reserves > been set are unlikely to exist as for much of the history of hugetlbfs, > pages were prefaulted at mmap(), allocating the pages at that point or failing > the mmap(). > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Adam Litke <agl@us.ibm.com> -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] Guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed 2008-05-27 18:51 ` [PATCH 3/3] Guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed Mel Gorman 2008-05-28 16:00 ` Mel Gorman 2008-05-28 18:16 ` Adam Litke @ 2008-05-29 1:42 ` Andrew Morton 2008-05-30 16:57 ` [PATCH 0/2] hugetlb reservations v4/MAP_NORESERVE V3 cleanups Andy Whitcroft 2 siblings, 1 reply; 16+ messages in thread From: Andrew Morton @ 2008-05-29 1:42 UTC (permalink / raw) To: Mel Gorman Cc: dean, linux-kernel, wli, dwg, apw, linux-mm, andi, kenchen, agl, abh On Tue, 27 May 2008 19:51:28 +0100 (IST) Mel Gorman <mel@csn.ul.ie> wrote: > > After patch 2 in this series, a process that successfully calls mmap() > for a MAP_PRIVATE mapping will be guaranteed to successfully fault until a > process calls fork(). At that point, the next write fault from the parent > could fail due to COW if the child still has a reference. > > We only reserve pages for the parent but a copy must be made to avoid leaking > data from the parent to the child after fork(). Reserves could be taken for > both parent and child at fork time to guarantee faults but if the mapping > is large it is highly likely we will not have sufficient pages for the > reservation, and it is common to fork only to exec() immediatly after. A > failure here would be very undesirable. > > Note that the current behaviour of mainline with MAP_PRIVATE pages is > pretty bad. The following situation is allowed to occur today. > > 1. Process calls mmap(MAP_PRIVATE) > 2. Process calls mlock() to fault all pages and makes sure it succeeds > 3. Process forks() > 4. Process writes to MAP_PRIVATE mapping while child still exists > 5. If the COW fails at this point, the process gets SIGKILLed even though it > had taken care to ensure the pages existed > > This patch improves the situation by guaranteeing the reliability of the > process that successfully calls mmap(). When the parent performs COW, it > will try to satisfy the allocation without using reserves. If that fails the > parent will steal the page leaving any children without a page. Faults from > the child after that point will result in failure. If the child COW happens > first, an attempt will be made to allocate the page without reserves and > the child will get SIGKILLed on failure. > > To summarise the new behaviour: > > 1. If the original mapper performs COW on a private mapping with multiple > references, it will attempt to allocate a hugepage from the pool or > the buddy allocator without using the existing reserves. On fail, VMAs > mapping the same area are traversed and the page being COW'd is unmapped > where found. It will then steal the original page as the last mapper in > the normal way. > > 2. The VMAs the pages were unmapped from are flagged to note that pages > with data no longer exist. Future no-page faults on those VMAs will > terminate the process as otherwise it would appear that data was corrupted. > A warning is printed to the console that this situation occured. > > 2. If the child performs COW first, it will attempt to satisfy the COW > from the pool if there are enough pages or via the buddy allocator if > overcommit is allowed and the buddy allocator can satisfy the request. If > it fails, the child will be killed. > > If the pool is large enough, existing applications will not notice that the > reserves were a factor. Existing applications depending on the no-reserves > been set are unlikely to exist as for much of the history of hugetlbfs, > pages were prefaulted at mmap(), allocating the pages at that point or failing > the mmap(). > Implementation nitlets: > +#define HPAGE_RESV_OWNER (1UL << (BITS_PER_LONG - 1)) > +#define HPAGE_RESV_UNMAPPED (1UL << (BITS_PER_LONG - 2)) > +#define HPAGE_RESV_MASK (HPAGE_RESV_OWNER | HPAGE_RESV_UNMAPPED) > /* > * These helpers are used to track how many pages are reserved for > * faults in a MAP_PRIVATE mapping. Only the process that called mmap() > @@ -54,17 +57,32 @@ static unsigned long vma_resv_huge_pages > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > if (!(vma->vm_flags & VM_SHARED)) > - return (unsigned long)vma->vm_private_data; > + return (unsigned long)vma->vm_private_data & ~HPAGE_RESV_MASK; > return 0; > } It might be better to have unsigned long get_vma_private_data(struct vm_area_struct); unsigned long set_vma_private_data(struct vm_area_struct); (or even get_private & set_private) to do away with all the funky casting. Or it might not be, too. > + pgoff_t pgoff = ((address - vma->vm_start) >> HPAGE_SHIFT) > + + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); > > ... > > + unsigned long idx; > + > + mapping = vma->vm_file->f_mapping; > + idx = ((address - vma->vm_start) >> HPAGE_SHIFT) > + + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); It would be clearer to have a little helper function for the above two snippets. And the first uses pgoff_t whereas the second uses bare ulong. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 0/2] hugetlb reservations v4/MAP_NORESERVE V3 cleanups 2008-05-29 1:42 ` Andrew Morton @ 2008-05-30 16:57 ` Andy Whitcroft 2008-05-30 16:58 ` [PATCH 1/2] huge page private reservation review cleanups Andy Whitcroft 2008-05-30 16:58 ` [PATCH 2/2] huge page MAP_NORESERVE " Andy Whitcroft 0 siblings, 2 replies; 16+ messages in thread From: Andy Whitcroft @ 2008-05-30 16:57 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, agl, wli, kenchen, dwg, andi, Mel Gorman, dean, abh, Andy Whitcroft Following on from Andrew's feedback here are a couple of update patches for hugetlb reservations v4 and MAP_NORESERVE V3. These introduce a number of new helpers and use those throughout. This stack consists of two patches to allow them to be merged into the appropriate stacks. huge-page-private-reservation-review-cleanups -- adds the helpers and utilises them in the base reservations stack, and huge-page-MAP_NORESERVE-review-cleanups -- uses the same helpers in the MAP_NORESERVE stack. -apw -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/2] huge page private reservation review cleanups 2008-05-30 16:57 ` [PATCH 0/2] hugetlb reservations v4/MAP_NORESERVE V3 cleanups Andy Whitcroft @ 2008-05-30 16:58 ` Andy Whitcroft 2008-05-30 20:29 ` Andrew Morton 2008-05-31 12:21 ` Mel Gorman 2008-05-30 16:58 ` [PATCH 2/2] huge page MAP_NORESERVE " Andy Whitcroft 1 sibling, 2 replies; 16+ messages in thread From: Andy Whitcroft @ 2008-05-30 16:58 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, agl, wli, kenchen, dwg, andi, Mel Gorman, dean, abh, Andy Whitcroft Create some new accessors for vma private data to cut down on and contain the casts. Encapsulates the huge and small page offset calculations. Also adds a couple of VM_BUG_ONs for consistency. Signed-off-by: Andy Whitcroft <apw@shadowen.org> --- mm/hugetlb.c | 56 +++++++++++++++++++++++++++++++++++++++++++------------- 1 files changed, 43 insertions(+), 13 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 729a830..7a5ac81 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -40,6 +40,26 @@ static int hugetlb_next_nid; */ static DEFINE_SPINLOCK(hugetlb_lock); +/* + * Convert the address within this vma to the page offset within + * the mapping, in base page units. + */ +pgoff_t vma_page_offset(struct vm_area_struct *vma, unsigned long address) +{ + return ((address - vma->vm_start) >> PAGE_SHIFT) + + (vma->vm_pgoff >> PAGE_SHIFT); +} + +/* + * Convert the address within this vma to the page offset within + * the mapping, in pagecache page units; huge pages here. + */ +pgoff_t vma_pagecache_offset(struct vm_area_struct *vma, unsigned long address) +{ + return ((address - vma->vm_start) >> HPAGE_SHIFT) + + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); +} + #define HPAGE_RESV_OWNER (1UL << (BITS_PER_LONG - 1)) #define HPAGE_RESV_UNMAPPED (1UL << (BITS_PER_LONG - 2)) #define HPAGE_RESV_MASK (HPAGE_RESV_OWNER | HPAGE_RESV_UNMAPPED) @@ -53,36 +73,48 @@ static DEFINE_SPINLOCK(hugetlb_lock); * to reset the VMA at fork() time as it is not in use yet and there is no * chance of the global counters getting corrupted as a result of the values. */ +static unsigned long get_vma_private_data(struct vm_area_struct *vma) +{ + return (unsigned long)vma->vm_private_data; +} + +static void set_vma_private_data(struct vm_area_struct *vma, + unsigned long value) +{ + vma->vm_private_data = (void *)value; +} + static unsigned long vma_resv_huge_pages(struct vm_area_struct *vma) { VM_BUG_ON(!is_vm_hugetlb_page(vma)); if (!(vma->vm_flags & VM_SHARED)) - return (unsigned long)vma->vm_private_data & ~HPAGE_RESV_MASK; + return get_vma_private_data(vma) & ~HPAGE_RESV_MASK; return 0; } static void set_vma_resv_huge_pages(struct vm_area_struct *vma, unsigned long reserve) { - unsigned long flags; VM_BUG_ON(!is_vm_hugetlb_page(vma)); VM_BUG_ON(vma->vm_flags & VM_SHARED); - flags = (unsigned long)vma->vm_private_data & HPAGE_RESV_MASK; - vma->vm_private_data = (void *)(reserve | flags); + set_vma_private_data(vma, + (get_vma_private_data(vma) & HPAGE_RESV_MASK) | reserve); } static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags) { - unsigned long reserveflags = (unsigned long)vma->vm_private_data; VM_BUG_ON(!is_vm_hugetlb_page(vma)); - vma->vm_private_data = (void *)(reserveflags | flags); + VM_BUG_ON(vma->vm_flags & VM_SHARED); + + set_vma_private_data(vma, get_vma_private_data(vma) | flags); } static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag) { VM_BUG_ON(!is_vm_hugetlb_page(vma)); - return ((unsigned long)vma->vm_private_data & flag) != 0; + + return (get_vma_private_data(vma) & flag) != 0; } /* Decrement the reserved pages in the hugepage pool by one */ @@ -1150,11 +1182,10 @@ static struct page *hugetlbfs_pagecache_page(struct vm_area_struct *vma, unsigned long address) { struct address_space *mapping; - unsigned long idx; + pgoff_t idx; mapping = vma->vm_file->f_mapping; - idx = ((address - vma->vm_start) >> HPAGE_SHIFT) - + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); + idx = vma_pagecache_offset(vma, address); return find_lock_page(mapping, idx); } @@ -1163,7 +1194,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *ptep, int write_access) { int ret = VM_FAULT_SIGBUS; - unsigned long idx; + pgoff_t idx; unsigned long size; struct page *page; struct address_space *mapping; @@ -1182,8 +1213,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, } mapping = vma->vm_file->f_mapping; - idx = ((address - vma->vm_start) >> HPAGE_SHIFT) - + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); + idx = vma_pagecache_offset(vma, address); /* * Use page lock to guard against racing truncation -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] huge page private reservation review cleanups 2008-05-30 16:58 ` [PATCH 1/2] huge page private reservation review cleanups Andy Whitcroft @ 2008-05-30 20:29 ` Andrew Morton 2008-05-31 13:06 ` Mel Gorman 2008-05-31 12:21 ` Mel Gorman 1 sibling, 1 reply; 16+ messages in thread From: Andrew Morton @ 2008-05-30 20:29 UTC (permalink / raw) To: Andy Whitcroft Cc: linux-mm, linux-kernel, agl, wli, kenchen, dwg, andi, mel, dean, abh On Fri, 30 May 2008 17:58:24 +0100 Andy Whitcroft <apw@shadowen.org> wrote: > > Create some new accessors for vma private data to cut down on and contain > the casts. Encapsulates the huge and small page offset calculations. Also > adds a couple of VM_BUG_ONs for consistency. > I'll stage this after Mel's hugetlb-guarantee-that-cow-faults-for-a-process-that-called-mmapmap_private-on-hugetlbfs-will-succeed.patch > --- > mm/hugetlb.c | 56 +++++++++++++++++++++++++++++++++++++++++++------------- > 1 files changed, 43 insertions(+), 13 deletions(-) > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 729a830..7a5ac81 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -40,6 +40,26 @@ static int hugetlb_next_nid; > */ > static DEFINE_SPINLOCK(hugetlb_lock); > > +/* > + * Convert the address within this vma to the page offset within > + * the mapping, in base page units. > + */ > +pgoff_t vma_page_offset(struct vm_area_struct *vma, unsigned long address) > +{ > + return ((address - vma->vm_start) >> PAGE_SHIFT) + > + (vma->vm_pgoff >> PAGE_SHIFT); > +} > + > +/* > + * Convert the address within this vma to the page offset within > + * the mapping, in pagecache page units; huge pages here. > + */ > +pgoff_t vma_pagecache_offset(struct vm_area_struct *vma, unsigned long address) > +{ > + return ((address - vma->vm_start) >> HPAGE_SHIFT) + > + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); > +} I'll make these static. > #define HPAGE_RESV_OWNER (1UL << (BITS_PER_LONG - 1)) > #define HPAGE_RESV_UNMAPPED (1UL << (BITS_PER_LONG - 2)) > #define HPAGE_RESV_MASK (HPAGE_RESV_OWNER | HPAGE_RESV_UNMAPPED) > @@ -53,36 +73,48 @@ static DEFINE_SPINLOCK(hugetlb_lock); > * to reset the VMA at fork() time as it is not in use yet and there is no > * chance of the global counters getting corrupted as a result of the values. > */ > +static unsigned long get_vma_private_data(struct vm_area_struct *vma) > +{ > + return (unsigned long)vma->vm_private_data; > +} > + > +static void set_vma_private_data(struct vm_area_struct *vma, > + unsigned long value) > +{ > + vma->vm_private_data = (void *)value; > +} Better. > static unsigned long vma_resv_huge_pages(struct vm_area_struct *vma) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > if (!(vma->vm_flags & VM_SHARED)) > - return (unsigned long)vma->vm_private_data & ~HPAGE_RESV_MASK; > + return get_vma_private_data(vma) & ~HPAGE_RESV_MASK; > return 0; > } But I wonder if helpers which manipulate a vma's HPAGE_RESV_MASK flag(s) rather than the whole vm_provate_data would have been better. > static void set_vma_resv_huge_pages(struct vm_area_struct *vma, > unsigned long reserve) > { > - unsigned long flags; > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > VM_BUG_ON(vma->vm_flags & VM_SHARED); > > - flags = (unsigned long)vma->vm_private_data & HPAGE_RESV_MASK; > - vma->vm_private_data = (void *)(reserve | flags); > + set_vma_private_data(vma, > + (get_vma_private_data(vma) & HPAGE_RESV_MASK) | reserve); > } > > static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags) > { > - unsigned long reserveflags = (unsigned long)vma->vm_private_data; > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > - vma->vm_private_data = (void *)(reserveflags | flags); > + VM_BUG_ON(vma->vm_flags & VM_SHARED); > + > + set_vma_private_data(vma, get_vma_private_data(vma) | flags); > } > > static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > - return ((unsigned long)vma->vm_private_data & flag) != 0; > + > + return (get_vma_private_data(vma) & flag) != 0; > } Oh. We already kinda have it. Perhaps vma_resv_huge_pages() should have called set_vma_resv_flags(). I guess the assertions would have busted that. Oh well, whatever. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] huge page private reservation review cleanups 2008-05-30 20:29 ` Andrew Morton @ 2008-05-31 13:06 ` Mel Gorman 0 siblings, 0 replies; 16+ messages in thread From: Mel Gorman @ 2008-05-31 13:06 UTC (permalink / raw) To: Andrew Morton Cc: Andy Whitcroft, linux-mm, linux-kernel, agl, wli, kenchen, dwg, andi, dean, abh On (30/05/08 13:29), Andrew Morton didst pronounce: > On Fri, 30 May 2008 17:58:24 +0100 > Andy Whitcroft <apw@shadowen.org> wrote: > > > > > Create some new accessors for vma private data to cut down on and contain > > the casts. Encapsulates the huge and small page offset calculations. Also > > adds a couple of VM_BUG_ONs for consistency. > > > > I'll stage this after Mel's > hugetlb-guarantee-that-cow-faults-for-a-process-that-called-mmapmap_private-on-hugetlbfs-will-succeed.patch > Sounds good, thanks. > > --- > > mm/hugetlb.c | 56 +++++++++++++++++++++++++++++++++++++++++++------------- > > 1 files changed, 43 insertions(+), 13 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 729a830..7a5ac81 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -40,6 +40,26 @@ static int hugetlb_next_nid; > > */ > > static DEFINE_SPINLOCK(hugetlb_lock); > > > > +/* > > + * Convert the address within this vma to the page offset within > > + * the mapping, in base page units. > > + */ > > +pgoff_t vma_page_offset(struct vm_area_struct *vma, unsigned long address) > > +{ > > + return ((address - vma->vm_start) >> PAGE_SHIFT) + > > + (vma->vm_pgoff >> PAGE_SHIFT); > > +} > > + > > +/* > > + * Convert the address within this vma to the page offset within > > + * the mapping, in pagecache page units; huge pages here. > > + */ > > +pgoff_t vma_pagecache_offset(struct vm_area_struct *vma, unsigned long address) > > +{ > > + return ((address - vma->vm_start) >> HPAGE_SHIFT) + > > + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); > > +} > > I'll make these static. > > > #define HPAGE_RESV_OWNER (1UL << (BITS_PER_LONG - 1)) > > #define HPAGE_RESV_UNMAPPED (1UL << (BITS_PER_LONG - 2)) > > #define HPAGE_RESV_MASK (HPAGE_RESV_OWNER | HPAGE_RESV_UNMAPPED) > > @@ -53,36 +73,48 @@ static DEFINE_SPINLOCK(hugetlb_lock); > > * to reset the VMA at fork() time as it is not in use yet and there is no > > * chance of the global counters getting corrupted as a result of the values. > > */ > > +static unsigned long get_vma_private_data(struct vm_area_struct *vma) > > +{ > > + return (unsigned long)vma->vm_private_data; > > +} > > + > > +static void set_vma_private_data(struct vm_area_struct *vma, > > + unsigned long value) > > +{ > > + vma->vm_private_data = (void *)value; > > +} > > Better. > > > static unsigned long vma_resv_huge_pages(struct vm_area_struct *vma) > > { > > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > > if (!(vma->vm_flags & VM_SHARED)) > > - return (unsigned long)vma->vm_private_data & ~HPAGE_RESV_MASK; > > + return get_vma_private_data(vma) & ~HPAGE_RESV_MASK; > > return 0; > > } > > But I wonder if helpers which manipulate a vma's HPAGE_RESV_MASK > flag(s) rather than the whole vm_provate_data would have been better. > There are helpers that do that below. It was suggested that I define a struct with bit-fields instead but I didn't feel it was much easier to understand than masks which are already pretty common. > > static void set_vma_resv_huge_pages(struct vm_area_struct *vma, > > unsigned long reserve) > > { > > - unsigned long flags; > > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > > VM_BUG_ON(vma->vm_flags & VM_SHARED); > > > > - flags = (unsigned long)vma->vm_private_data & HPAGE_RESV_MASK; > > - vma->vm_private_data = (void *)(reserve | flags); > > + set_vma_private_data(vma, > > + (get_vma_private_data(vma) & HPAGE_RESV_MASK) | reserve); > > } > > > > static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags) > > { > > - unsigned long reserveflags = (unsigned long)vma->vm_private_data; > > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > > - vma->vm_private_data = (void *)(reserveflags | flags); > > + VM_BUG_ON(vma->vm_flags & VM_SHARED); > > + > > + set_vma_private_data(vma, get_vma_private_data(vma) | flags); > > } > > > > static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag) > > { > > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > > - return ((unsigned long)vma->vm_private_data & flag) != 0; > > + > > + return (get_vma_private_data(vma) & flag) != 0; > > } > > Oh. We already kinda have it. Perhaps vma_resv_huge_pages() should > have called set_vma_resv_flags(). I guess the assertions would have > busted that. > The assertions as-is would have made that hard all right, but the checks (particularly the SHARED ones) that are there are really defensive in nature rather than set in stone. A VM_SHARED mapping could use the flags as well if there was a good reason for it but I didn't want the helpers to be used by accident. > Oh well, whatever. > I am currently under the belief that the helpers as-is are fairly easy to understand, should not interfere badly with the 1GB-and-multi-large-page-support being worked on and are reasonably difficult to use incorrectly but I'm open to being corrected on it. Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] huge page private reservation review cleanups 2008-05-30 16:58 ` [PATCH 1/2] huge page private reservation review cleanups Andy Whitcroft 2008-05-30 20:29 ` Andrew Morton @ 2008-05-31 12:21 ` Mel Gorman 1 sibling, 0 replies; 16+ messages in thread From: Mel Gorman @ 2008-05-31 12:21 UTC (permalink / raw) To: Andy Whitcroft Cc: Andrew Morton, linux-mm, linux-kernel, agl, wli, kenchen, dwg, andi, dean, abh On (30/05/08 17:58), Andy Whitcroft didst pronounce: > > Create some new accessors for vma private data to cut down on and contain > the casts. Encapsulates the huge and small page offset calculations. Also > adds a couple of VM_BUG_ONs for consistency. > Things are a bit more readable with the helpers for sure. Thanks for catching the missing VM_BUG_ONs as well. > Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> > --- > mm/hugetlb.c | 56 +++++++++++++++++++++++++++++++++++++++++++------------- > 1 files changed, 43 insertions(+), 13 deletions(-) > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 729a830..7a5ac81 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -40,6 +40,26 @@ static int hugetlb_next_nid; > */ > static DEFINE_SPINLOCK(hugetlb_lock); > > +/* > + * Convert the address within this vma to the page offset within > + * the mapping, in base page units. > + */ > +pgoff_t vma_page_offset(struct vm_area_struct *vma, unsigned long address) > +{ > + return ((address - vma->vm_start) >> PAGE_SHIFT) + > + (vma->vm_pgoff >> PAGE_SHIFT); > +} > + > +/* > + * Convert the address within this vma to the page offset within > + * the mapping, in pagecache page units; huge pages here. > + */ > +pgoff_t vma_pagecache_offset(struct vm_area_struct *vma, unsigned long address) > +{ > + return ((address - vma->vm_start) >> HPAGE_SHIFT) + > + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); > +} > + > #define HPAGE_RESV_OWNER (1UL << (BITS_PER_LONG - 1)) > #define HPAGE_RESV_UNMAPPED (1UL << (BITS_PER_LONG - 2)) > #define HPAGE_RESV_MASK (HPAGE_RESV_OWNER | HPAGE_RESV_UNMAPPED) > @@ -53,36 +73,48 @@ static DEFINE_SPINLOCK(hugetlb_lock); > * to reset the VMA at fork() time as it is not in use yet and there is no > * chance of the global counters getting corrupted as a result of the values. > */ > +static unsigned long get_vma_private_data(struct vm_area_struct *vma) > +{ > + return (unsigned long)vma->vm_private_data; > +} > + > +static void set_vma_private_data(struct vm_area_struct *vma, > + unsigned long value) > +{ > + vma->vm_private_data = (void *)value; > +} > + > static unsigned long vma_resv_huge_pages(struct vm_area_struct *vma) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > if (!(vma->vm_flags & VM_SHARED)) > - return (unsigned long)vma->vm_private_data & ~HPAGE_RESV_MASK; > + return get_vma_private_data(vma) & ~HPAGE_RESV_MASK; > return 0; > } > > static void set_vma_resv_huge_pages(struct vm_area_struct *vma, > unsigned long reserve) > { > - unsigned long flags; > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > VM_BUG_ON(vma->vm_flags & VM_SHARED); > > - flags = (unsigned long)vma->vm_private_data & HPAGE_RESV_MASK; > - vma->vm_private_data = (void *)(reserve | flags); > + set_vma_private_data(vma, > + (get_vma_private_data(vma) & HPAGE_RESV_MASK) | reserve); > } > > static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags) > { > - unsigned long reserveflags = (unsigned long)vma->vm_private_data; > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > - vma->vm_private_data = (void *)(reserveflags | flags); > + VM_BUG_ON(vma->vm_flags & VM_SHARED); > + > + set_vma_private_data(vma, get_vma_private_data(vma) | flags); > } > > static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag) > { > VM_BUG_ON(!is_vm_hugetlb_page(vma)); > - return ((unsigned long)vma->vm_private_data & flag) != 0; > + > + return (get_vma_private_data(vma) & flag) != 0; > } > > /* Decrement the reserved pages in the hugepage pool by one */ > @@ -1150,11 +1182,10 @@ static struct page *hugetlbfs_pagecache_page(struct vm_area_struct *vma, > unsigned long address) > { > struct address_space *mapping; > - unsigned long idx; > + pgoff_t idx; > > mapping = vma->vm_file->f_mapping; > - idx = ((address - vma->vm_start) >> HPAGE_SHIFT) > - + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); > + idx = vma_pagecache_offset(vma, address); > > return find_lock_page(mapping, idx); > } > @@ -1163,7 +1194,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, > unsigned long address, pte_t *ptep, int write_access) > { > int ret = VM_FAULT_SIGBUS; > - unsigned long idx; > + pgoff_t idx; > unsigned long size; > struct page *page; > struct address_space *mapping; > @@ -1182,8 +1213,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, > } > > mapping = vma->vm_file->f_mapping; > - idx = ((address - vma->vm_start) >> HPAGE_SHIFT) > - + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); > + idx = vma_pagecache_offset(vma, address); > > /* > * Use page lock to guard against racing truncation > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 2/2] huge page MAP_NORESERVE review cleanups 2008-05-30 16:57 ` [PATCH 0/2] hugetlb reservations v4/MAP_NORESERVE V3 cleanups Andy Whitcroft 2008-05-30 16:58 ` [PATCH 1/2] huge page private reservation review cleanups Andy Whitcroft @ 2008-05-30 16:58 ` Andy Whitcroft 1 sibling, 0 replies; 16+ messages in thread From: Andy Whitcroft @ 2008-05-30 16:58 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, agl, wli, kenchen, dwg, andi, Mel Gorman, dean, abh, Andy Whitcroft Use the new encapsulated huge page offset helper. Signed-off-by: Andy Whitcroft <apw@shadowen.org> --- mm/hugetlb.c | 6 ++---- 1 files changed, 2 insertions(+), 4 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1dce03a..901e580 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -733,8 +733,7 @@ static int vma_needs_reservation(struct vm_area_struct *vma, unsigned long addr) struct inode *inode = mapping->host; if (vma->vm_flags & VM_SHARED) { - unsigned long idx = ((addr - vma->vm_start) >> HPAGE_SHIFT) + - (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); + pgoff_t idx = vma_pagecache_offset(vma, addr); return region_chg(&inode->i_mapping->private_list, idx, idx + 1); @@ -752,8 +751,7 @@ static void vma_commit_reservation(struct vm_area_struct *vma, struct inode *inode = mapping->host; if (vma->vm_flags & VM_SHARED) { - unsigned long idx = ((addr - vma->vm_start) >> HPAGE_SHIFT) + - (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); + pgoff_t idx = vma_pagecache_offset(vma, addr); region_add(&inode->i_mapping->private_list, idx, idx + 1); } } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 16+ messages in thread
end of thread, other threads:[~2008-05-31 13:06 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-05-27 18:50 [PATCH 0/3] Guarantee faults for processes that call mmap(MAP_PRIVATE) on hugetlbfs v4 Mel Gorman 2008-05-27 18:50 ` [PATCH 1/3] Move hugetlb_acct_memory() Mel Gorman 2008-05-28 13:37 ` Adam Litke 2008-05-27 18:51 ` [PATCH 2/3] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings until fork() Mel Gorman 2008-05-28 13:52 ` Adam Litke 2008-05-27 18:51 ` [PATCH 3/3] Guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed Mel Gorman 2008-05-28 16:00 ` Mel Gorman 2008-05-28 18:15 ` Adam Litke 2008-05-28 18:16 ` Adam Litke 2008-05-29 1:42 ` Andrew Morton 2008-05-30 16:57 ` [PATCH 0/2] hugetlb reservations v4/MAP_NORESERVE V3 cleanups Andy Whitcroft 2008-05-30 16:58 ` [PATCH 1/2] huge page private reservation review cleanups Andy Whitcroft 2008-05-30 20:29 ` Andrew Morton 2008-05-31 13:06 ` Mel Gorman 2008-05-31 12:21 ` Mel Gorman 2008-05-30 16:58 ` [PATCH 2/2] huge page MAP_NORESERVE " Andy Whitcroft
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).