* [PATCH 0/3] Demand faulting for huge pages
@ 2005-09-28 20:25 Adam Litke
2005-09-28 20:31 ` [PATCH 1/3 htlb-get_user_pages] " Adam Litke
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: Adam Litke @ 2005-09-28 20:25 UTC (permalink / raw)
To: akpm; +Cc: ADAM G. LITKE [imap], linux-kernel, linux-mm
Hi Andrew. Can we give hugetlb demand faulting a spin in the mm tree?
And could people with alpha, sparc, and ia64 machines give them a good
spin? I haven't been able to test those arches yet.
-Thanks
- htlb-get_user_pages removes an optimization that is no longer valid
when demand faulting huge pages
- htlb-fault moves the fault logic from hugetlb_prefault() to
hugetlb_pte_fault() and find_get_huge_page().
- htlb-acct adds an overcommit check to maintain the no-overcommit
semantics provided by hugetlb_prefault()
--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH 1/3 htlb-get_user_pages] Demand faulting for huge pages 2005-09-28 20:25 [PATCH 0/3] Demand faulting for huge pages Adam Litke @ 2005-09-28 20:31 ` Adam Litke 2005-09-28 20:32 ` [PATCH 2/3 htlb-fault] " Adam Litke ` (2 subsequent siblings) 3 siblings, 0 replies; 16+ messages in thread From: Adam Litke @ 2005-09-28 20:31 UTC (permalink / raw) To: akpm; +Cc: ADAM G. LITKE [imap], linux-kernel, linux-mm Initial Post (Thu, 18 Aug 2005) In preparation for hugetlb demand faulting, remove this get_user_pages() optimization. Since huge pages will no longer be prefaulted, we can't assume that the huge ptes are established and hence, calling follow_hugetlb_page() is not valid. With the follow_hugetlb_page() call removed, the normal code path will be triggered. follow_page() will either use follow_huge_addr() or follow_huge_pmd() to check for a previously faulted "page" to return. When this fails (ie. with demand faults), __handle_mm_fault() gets called which invokes the hugetlb_fault() handler to instantiate the huge page. This patch doesn't make a lot of sense by itself, but I've broken it out to facilitate discussion on this specific element of the demand fault changes. While coding this up, I referenced previous discussion on this topic starting at http://lkml.org/lkml/2004/4/13/176 , which contains more opinions about the correctness of this approach. Diffed against 2.6.14-rc2-git6 Signed-off-by: Adam Litke <agl@us.ibm.com> --- memory.c | 5 ----- 1 files changed, 5 deletions(-) diff -upN reference/mm/memory.c current/mm/memory.c --- reference/mm/memory.c +++ current/mm/memory.c @@ -949,11 +949,6 @@ int get_user_pages(struct task_struct *t || !(flags & vma->vm_flags)) return i ? : -EFAULT; - if (is_vm_hugetlb_page(vma)) { - i = follow_hugetlb_page(mm, vma, pages, vmas, - &start, &len, i); - continue; - } spin_lock(&mm->page_table_lock); do { int write_access = write; -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3 htlb-fault] Demand faulting for huge pages 2005-09-28 20:25 [PATCH 0/3] Demand faulting for huge pages Adam Litke 2005-09-28 20:31 ` [PATCH 1/3 htlb-get_user_pages] " Adam Litke @ 2005-09-28 20:32 ` Adam Litke 2005-09-29 6:09 ` Andrew Morton 2005-09-29 6:10 ` Andrew Morton 2005-09-28 20:33 ` [PATCH 3/3 htlb-acct] " Adam Litke 2005-09-29 13:32 ` [PATCH 0/3] Demand faulting for huge pages Hugh Dickins 3 siblings, 2 replies; 16+ messages in thread From: Adam Litke @ 2005-09-28 20:32 UTC (permalink / raw) To: akpm; +Cc: ADAM G. LITKE [imap], linux-kernel, linux-mm Version 3 (Thu, 08 Sep 2005) Organized logic in hugetlb_pte_fault() by breaking out find_get_page/alloc_huge_page logic into separate function Removed a few more paranoid checks < Fixed tlb flushing in a race case < (thanks Yanmin Zhang) Version 2 (Wed, 17 Aug 2005) Removed spurious WARN_ON() Patches added earlier in the series: Check for p?d_none() in arch/i386/mm/hugetlbpage.c:huge_pte_offset() Move i386 stale pte check into huge_pte_alloc() Initial Post (Fri, 05 Aug 2005) Below is a patch to implement demand faulting for huge pages. The main motivation for changing from prefaulting to demand faulting is so that huge page memory areas can be allocated according to NUMA policy. Thanks to consolidated hugetlb code, switching the behavior requires changing only one fault handler. The bulk of the patch just moves the logic from hugelb_prefault() to hugetlb_pte_fault() and find_get_huge_page(). Diffed against 2.6.14-rc2-git6 Signed-off-by: Adam Litke <agl@us.ibm.com> --- fs/hugetlbfs/inode.c | 6 - include/linux/hugetlb.h | 2 mm/hugetlb.c | 154 +++++++++++++++++++++++++++++------------------- mm/memory.c | 2 4 files changed, 98 insertions(+), 66 deletions(-) diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c --- reference/fs/hugetlbfs/inode.c +++ current/fs/hugetlbfs/inode.c @@ -48,7 +48,6 @@ int sysctl_hugetlb_shm_group; static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) { struct inode *inode = file->f_dentry->d_inode; - struct address_space *mapping = inode->i_mapping; loff_t len, vma_len; int ret; @@ -79,10 +78,7 @@ static int hugetlbfs_file_mmap(struct fi if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size) goto out; - ret = hugetlb_prefault(mapping, vma); - if (ret) - goto out; - + ret = 0; if (inode->i_size < len) inode->i_size = len; out: diff -upN reference/include/linux/hugetlb.h current/include/linux/hugetlb.h --- reference/include/linux/hugetlb.h +++ current/include/linux/hugetlb.h @@ -25,6 +25,8 @@ int is_hugepage_mem_enough(size_t); unsigned long hugetlb_total_pages(void); struct page *alloc_huge_page(void); void free_huge_page(struct page *); +int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct * vma, + unsigned long address, int write_access); extern unsigned long max_huge_pages; extern const unsigned long hugetlb_zero, hugetlb_infinity; diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c --- reference/mm/hugetlb.c +++ current/mm/hugetlb.c @@ -274,21 +274,22 @@ int copy_hugetlb_page_range(struct mm_st { pte_t *src_pte, *dst_pte, entry; struct page *ptepage; - unsigned long addr = vma->vm_start; + unsigned long addr; unsigned long end = vma->vm_end; - while (addr < end) { + for (addr = vma->vm_start; addr < end; addr += HPAGE_SIZE) { + src_pte = huge_pte_offset(src, addr); + if (!src_pte || pte_none(*src_pte)) + continue; + dst_pte = huge_pte_alloc(dst, addr); if (!dst_pte) goto nomem; - src_pte = huge_pte_offset(src, addr); - BUG_ON(!src_pte || pte_none(*src_pte)); /* prefaulted */ entry = *src_pte; ptepage = pte_page(entry); get_page(ptepage); add_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE); set_huge_pte_at(dst, addr, dst_pte, entry); - addr += HPAGE_SIZE; } return 0; @@ -338,61 +339,6 @@ void zap_hugepage_range(struct vm_area_s spin_unlock(&mm->page_table_lock); } -int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma) -{ - struct mm_struct *mm = current->mm; - unsigned long addr; - int ret = 0; - - WARN_ON(!is_vm_hugetlb_page(vma)); - BUG_ON(vma->vm_start & ~HPAGE_MASK); - BUG_ON(vma->vm_end & ~HPAGE_MASK); - - hugetlb_prefault_arch_hook(mm); - - spin_lock(&mm->page_table_lock); - for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) { - unsigned long idx; - pte_t *pte = huge_pte_alloc(mm, addr); - struct page *page; - - if (!pte) { - ret = -ENOMEM; - goto out; - } - - idx = ((addr - vma->vm_start) >> HPAGE_SHIFT) - + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); - page = find_get_page(mapping, idx); - if (!page) { - /* charge the fs quota first */ - if (hugetlb_get_quota(mapping)) { - ret = -ENOMEM; - goto out; - } - page = alloc_huge_page(); - if (!page) { - hugetlb_put_quota(mapping); - ret = -ENOMEM; - goto out; - } - ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC); - if (! ret) { - unlock_page(page); - } else { - hugetlb_put_quota(mapping); - free_huge_page(page); - goto out; - } - } - add_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE); - set_huge_pte_at(mm, addr, pte, make_huge_pte(vma, page)); - } -out: - spin_unlock(&mm->page_table_lock); - return ret; -} - int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, unsigned long *position, int *length, int i) @@ -440,3 +386,91 @@ int follow_hugetlb_page(struct mm_struct return i; } + +static struct page *find_get_huge_page(struct address_space *mapping, + unsigned long idx) +{ + struct page *page = NULL; + +retry: + page = find_get_page(mapping, idx); + if (page) + goto out; + + if (hugetlb_get_quota(mapping)) + goto out; + page = alloc_huge_page(); + if (!page) { + hugetlb_put_quota(mapping); + goto out; + } + + if (add_to_page_cache(page, mapping, idx, GFP_ATOMIC)) { + put_page(page); + hugetlb_put_quota(mapping); + goto retry; + } + unlock_page(page); +out: + return page; +} + +static int hugetlb_pte_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, int write_access) +{ + int ret = VM_FAULT_MINOR; + unsigned long idx; + pte_t *pte; + struct page *page; + struct address_space *mapping; + + BUG_ON(vma->vm_start & ~HPAGE_MASK); + BUG_ON(vma->vm_end & ~HPAGE_MASK); + BUG_ON(!vma->vm_file); + + pte = huge_pte_offset(mm, address); + if (!pte) { + ret = VM_FAULT_SIGBUS; + goto out; + } + if (!pte_none(*pte)) + goto out; + + mapping = vma->vm_file->f_mapping; + idx = ((address - vma->vm_start) >> HPAGE_SHIFT) + + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT)); + + page = find_get_huge_page(mapping, idx); + if (!page) { + ret = VM_FAULT_SIGBUS; + goto out; + } + + add_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE); + set_huge_pte_at(mm, address, pte, make_huge_pte(vma, page)); +out: + return ret; +} + +int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, int write_access) +{ + pte_t *ptep; + int rc = VM_FAULT_MINOR; + + spin_lock(&mm->page_table_lock); + + ptep = huge_pte_alloc(mm, address); + if (!ptep) { + rc = VM_FAULT_SIGBUS; + goto out; + } + if (pte_none(*ptep)) + rc = hugetlb_pte_fault(mm, vma, address, write_access); +out: + if (rc == VM_FAULT_MINOR) + flush_tlb_page(vma, address); + + spin_unlock(&mm->page_table_lock); + return rc; +} diff -upN reference/mm/memory.c current/mm/memory.c --- reference/mm/memory.c +++ current/mm/memory.c @@ -2041,7 +2041,7 @@ int __handle_mm_fault(struct mm_struct * inc_page_state(pgfault); if (is_vm_hugetlb_page(vma)) - return VM_FAULT_SIGBUS; /* mapping truncation does this. */ + return hugetlb_fault(mm, vma, address, write_access); /* * We need the page table lock to synchronize with kswapd -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3 htlb-fault] Demand faulting for huge pages 2005-09-28 20:32 ` [PATCH 2/3 htlb-fault] " Adam Litke @ 2005-09-29 6:09 ` Andrew Morton 2005-09-29 6:10 ` Andrew Morton 1 sibling, 0 replies; 16+ messages in thread From: Andrew Morton @ 2005-09-29 6:09 UTC (permalink / raw) To: Adam Litke; +Cc: agl, linux-kernel, linux-mm Adam Litke <agl@us.ibm.com> wrote: > > +static struct page *find_get_huge_page(struct address_space *mapping, > + unsigned long idx) > +{ > + struct page *page = NULL; > + > +retry: > + page = find_get_page(mapping, idx); > + if (page) > + goto out; > + > + if (hugetlb_get_quota(mapping)) > + goto out; > + page = alloc_huge_page(); > + if (!page) { > + hugetlb_put_quota(mapping); > + goto out; > + } > + > + if (add_to_page_cache(page, mapping, idx, GFP_ATOMIC)) { > + put_page(page); > + hugetlb_put_quota(mapping); > + goto retry; If add_to_page_cache() fails due to failure in radix_tree_preload(), this code will lock up. A lame fix is to check for -ENOMEM and bale. A better fix would be to use GFP_KERNEL. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3 htlb-fault] Demand faulting for huge pages 2005-09-28 20:32 ` [PATCH 2/3 htlb-fault] " Adam Litke 2005-09-29 6:09 ` Andrew Morton @ 2005-09-29 6:10 ` Andrew Morton 1 sibling, 0 replies; 16+ messages in thread From: Andrew Morton @ 2005-09-29 6:10 UTC (permalink / raw) To: Adam Litke; +Cc: agl, linux-kernel, linux-mm Adam Litke <agl@us.ibm.com> wrote: > > +int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, > + unsigned long address, int write_access) > +{ > + pte_t *ptep; > + int rc = VM_FAULT_MINOR; > + > + spin_lock(&mm->page_table_lock); > + > + ptep = huge_pte_alloc(mm, address); > + if (!ptep) { > + rc = VM_FAULT_SIGBUS; > + goto out; > + } > + if (pte_none(*ptep)) > + rc = hugetlb_pte_fault(mm, vma, address, write_access); > +out: > + if (rc == VM_FAULT_MINOR) > + flush_tlb_page(vma, address); > + > + spin_unlock(&mm->page_table_lock); > + return rc; > +} label `out' can be moved down a couple of lines. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 htlb-acct] Demand faulting for huge pages 2005-09-28 20:25 [PATCH 0/3] Demand faulting for huge pages Adam Litke 2005-09-28 20:31 ` [PATCH 1/3 htlb-get_user_pages] " Adam Litke 2005-09-28 20:32 ` [PATCH 2/3 htlb-fault] " Adam Litke @ 2005-09-28 20:33 ` Adam Litke 2005-09-29 6:20 ` Andrew Morton 2005-09-29 13:32 ` [PATCH 0/3] Demand faulting for huge pages Hugh Dickins 3 siblings, 1 reply; 16+ messages in thread From: Adam Litke @ 2005-09-28 20:33 UTC (permalink / raw) To: akpm; +Cc: ADAM G. LITKE [imap], linux-kernel, linux-mm Initial Post (Thu, 18 Aug 2005) Basic overcommit checking for hugetlb_file_map() based on an implementation used with demand faulting in SLES9. Since demand faulting can't guarantee the availability of pages at mmap time, this patch implements a basic sanity check to ensure that the number of huge pages required to satisfy the mmap are currently available. Despite the obvious race, I think it is a good start on doing proper accounting. I'd like to work towards an accounting system that mimics the semantics of normal pages (especially for the MAP_PRIVATE/COW case). That work is underway and builds on what this patch starts. Huge page shared memory segments are simpler and still maintain their commit on shmget semantics. Diffed against 2.6.14-rc2-git6 Signed-off-by: Adam Litke <agl@us.ibm.com> --- inode.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 47 insertions(+) diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c --- reference/fs/hugetlbfs/inode.c +++ current/fs/hugetlbfs/inode.c @@ -45,9 +45,51 @@ static struct backing_dev_info hugetlbfs int sysctl_hugetlb_shm_group; +static void huge_pagevec_release(struct pagevec *pvec); + +unsigned long +huge_pages_needed(struct address_space *mapping, struct vm_area_struct *vma) +{ + int i; + struct pagevec pvec; + unsigned long start = vma->vm_start; + unsigned long end = vma->vm_end; + unsigned long hugepages = (end - start) >> HPAGE_SHIFT; + pgoff_t next = vma->vm_pgoff; + pgoff_t endpg = next + ((end - start) >> PAGE_SHIFT); + struct inode *inode = vma->vm_file->f_dentry->d_inode; + + /* + * Shared memory segments are accounted for at shget time, + * not at shmat (when the mapping is actually created) so + * check here if the memory has already been accounted for. + */ + if (inode->i_blocks != 0) + return 0; + + pagevec_init(&pvec, 0); + while (next < endpg) { + if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) + break; + for (i = 0; i < pagevec_count(&pvec); i++) { + struct page *page = pvec.pages[i]; + if (page->index > next) + next = page->index; + if (page->index >= endpg) + break; + next++; + hugepages--; + } + huge_pagevec_release(&pvec); + } + return hugepages << HPAGE_SHIFT; +} + static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) { struct inode *inode = file->f_dentry->d_inode; + struct address_space *mapping = inode->i_mapping; + unsigned long bytes; loff_t len, vma_len; int ret; @@ -66,6 +108,10 @@ static int hugetlbfs_file_mmap(struct fi if (vma->vm_end - vma->vm_start < HPAGE_SIZE) return -EINVAL; + bytes = huge_pages_needed(mapping, vma); + if (!is_hugepage_mem_enough(bytes)) + return -ENOMEM; + vma_len = (loff_t)(vma->vm_end - vma->vm_start); down(&inode->i_sem); @@ -794,6 +840,7 @@ struct file *hugetlb_zero_setup(size_t s d_instantiate(dentry, inode); inode->i_size = size; inode->i_nlink = 0; + inode->i_blocks = 1; file->f_vfsmnt = mntget(hugetlbfs_vfsmount); file->f_dentry = dentry; file->f_mapping = inode->i_mapping; -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 htlb-acct] Demand faulting for huge pages 2005-09-28 20:33 ` [PATCH 3/3 htlb-acct] " Adam Litke @ 2005-09-29 6:20 ` Andrew Morton 2005-09-29 9:45 ` Andi Kleen 0 siblings, 1 reply; 16+ messages in thread From: Andrew Morton @ 2005-09-29 6:20 UTC (permalink / raw) To: Adam Litke; +Cc: agl, linux-kernel, linux-mm Adam Litke <agl@us.ibm.com> wrote: > > Initial Post (Thu, 18 Aug 2005) > > Basic overcommit checking for hugetlb_file_map() based on an implementation > used with demand faulting in SLES9. > > Since demand faulting can't guarantee the availability of pages at mmap time, > this patch implements a basic sanity check to ensure that the number of huge > pages required to satisfy the mmap are currently available. Despite the > obvious race, I think it is a good start on doing proper accounting. I'd like > to work towards an accounting system that mimics the semantics of normal pages > (especially for the MAP_PRIVATE/COW case). That work is underway and builds on > what this patch starts. > > Huge page shared memory segments are simpler and still maintain their commit on > shmget semantics. > > Diffed against 2.6.14-rc2-git6 > > Signed-off-by: Adam Litke <agl@us.ibm.com> > --- > inode.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++ > 1 files changed, 47 insertions(+) > diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c > --- reference/fs/hugetlbfs/inode.c > +++ current/fs/hugetlbfs/inode.c > @@ -45,9 +45,51 @@ static struct backing_dev_info hugetlbfs > > int sysctl_hugetlb_shm_group; > > +static void huge_pagevec_release(struct pagevec *pvec); nit: personally I prefer to move the helper function to the top of the file rather than having to forward-declare it. > +unsigned long > +huge_pages_needed(struct address_space *mapping, struct vm_area_struct *vma) > +{ What does this function do? Seems to count all the present pages within a vma which are backed by a particular hugetlbfs file? Or something? If so, the chosen name seems strange. And it definitely needs a decent comment. > + int i; > + struct pagevec pvec; > + unsigned long start = vma->vm_start; > + unsigned long end = vma->vm_end; > + unsigned long hugepages = (end - start) >> HPAGE_SHIFT; `hugepages' is the size of the vma > + pgoff_t next = vma->vm_pgoff; > + pgoff_t endpg = next + ((end - start) >> PAGE_SHIFT); > + struct inode *inode = vma->vm_file->f_dentry->d_inode; > + > + /* > + * Shared memory segments are accounted for at shget time, > + * not at shmat (when the mapping is actually created) so > + * check here if the memory has already been accounted for. > + */ > + if (inode->i_blocks != 0) > + return 0; > + > + pagevec_init(&pvec, 0); > + while (next < endpg) { > + if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) > + break; > + for (i = 0; i < pagevec_count(&pvec); i++) { > + struct page *page = pvec.pages[i]; > + if (page->index > next) > + next = page->index; > + if (page->index >= endpg) > + break; > + next++; > + hugepages--; And we subtract one from it for each present page. > + } > + huge_pagevec_release(&pvec); > + } > + return hugepages << HPAGE_SHIFT; > +} So it seems to be returning the number of bytes which are still unpopulated within this vma? Think you can rework this code to reduce my perplexity? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 htlb-acct] Demand faulting for huge pages 2005-09-29 6:20 ` Andrew Morton @ 2005-09-29 9:45 ` Andi Kleen 2005-09-29 13:40 ` [NUMA , x86_64] Why memnode_shift is chosen with the lowest possible value ? Eric Dumazet 0 siblings, 1 reply; 16+ messages in thread From: Andi Kleen @ 2005-09-29 9:45 UTC (permalink / raw) To: Andrew Morton; +Cc: agl, linux-kernel, linux-mm Andrew Morton <akpm@osdl.org> writes: (having written the original SLES9 code I will chime in ...) > > +unsigned long > > +huge_pages_needed(struct address_space *mapping, struct vm_area_struct *vma) > > +{ > > What does this function do? Seems to count all the present pages within a > vma which are backed by a particular hugetlbfs file? Or something? It counts how many huge pages are still needed to fill up a mapping completely. In short it counts the holes. I think the name fits. -Andi ^ permalink raw reply [flat|nested] 16+ messages in thread
* [NUMA , x86_64] Why memnode_shift is chosen with the lowest possible value ? 2005-09-29 9:45 ` Andi Kleen @ 2005-09-29 13:40 ` Eric Dumazet 2005-09-29 13:43 ` Andi Kleen 0 siblings, 1 reply; 16+ messages in thread From: Eric Dumazet @ 2005-09-29 13:40 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel Hi Andi I have a dual Opteron machine, with 8GB of ram on each node. With latest kernels I have high CPU profiles in mm/slab.c kfree() is NUMA aware, so far so good, but the price seems heavy. I noticed in 2.6.14-rc2 syslog : Node 0 MemBase 0000000000000000 Limit 00000001ffffffff Node 1 MemBase 0000000200000000 Limit 00000003ffffffff Using 23 for the hash shift. Max adder is 3ffffffff instead of previous (2.6.13) : Node 0 MemBase 0000000000000000 Limit 00000001ffffffff Node 1 MemBase 0000000200000000 Limit 00000003ffffffff Using 27 for the hash shift. Max adder is 3ffffffff After some code review, I see NODEMAPSIZE raised from 0xff to 0xfff phys_to_nid() is now reading one byte out of 2048 bytes with (memnode_shift=23, units of 8MB). But shouldnt we try to use the highest possible value for memnode_shift ? Using memnode_shift=33 would access only 2 bytes from this memnodemap[], touching fewer cache lines (well , one cache line). kfree() and friends would be slightly faster, at least cache friendly. Another question is : Could we add in pda (struct x8664_pda) the node of the cpu ? We currently do : #define numa_node_id() (cpu_to_node(raw_smp_processor_id())) Instead of reading the processor_id from pda, then access cpu_to_node[], we could directly get this information from pda. #if defined(CONFIG_NUMA) static inline __attribute_pure__ int numa_node_id() { return read_pda(node);} #else #define numa_node_id() 0 #endif Thank you Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [NUMA , x86_64] Why memnode_shift is chosen with the lowest possible value ? 2005-09-29 13:40 ` [NUMA , x86_64] Why memnode_shift is chosen with the lowest possible value ? Eric Dumazet @ 2005-09-29 13:43 ` Andi Kleen 2005-09-29 16:59 ` Eric Dumazet 0 siblings, 1 reply; 16+ messages in thread From: Andi Kleen @ 2005-09-29 13:43 UTC (permalink / raw) To: Eric Dumazet; +Cc: Andi Kleen, linux-kernel > Using memnode_shift=33 would access only 2 bytes from this memnodemap[], > touching fewer cache lines (well , one cache line). kfree() and friends > would be slightly faster, at least cache friendly. Agreed. Please send a patch. > > Another question is : > > Could we add in pda (struct x8664_pda) the node of the cpu ? > > We currently do : > > #define numa_node_id() (cpu_to_node(raw_smp_processor_id())) > > Instead of reading the processor_id from pda, then access cpu_to_node[], we > could directly get this information from pda. > > #if defined(CONFIG_NUMA) > static inline __attribute_pure__ int numa_node_id() { return > read_pda(node);} > #else > #define numa_node_id() 0 > #endif Should be fine too. Please send a patch for that too. -Andi ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [NUMA , x86_64] Why memnode_shift is chosen with the lowest possible value ? 2005-09-29 13:43 ` Andi Kleen @ 2005-09-29 16:59 ` Eric Dumazet 2005-09-30 9:09 ` Eric Dumazet 0 siblings, 1 reply; 16+ messages in thread From: Eric Dumazet @ 2005-09-29 16:59 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 964 bytes --] Andi Kleen a écrit : >>Using memnode_shift=33 would access only 2 bytes from this memnodemap[], >>touching fewer cache lines (well , one cache line). kfree() and friends >>would be slightly faster, at least cache friendly. > > > Agreed. Please send a patch. OK let's try :) [PATCH] NUMA,x86_64 : Compute the highest possible value for memnode_shift, in order to reduce footprint of memnodemap[] to the minimum, thus making all users (phys_to_nid(), kfree()), more cache friendly. Before the patch : Node 0 MemBase 0000000000000000 Limit 00000001ffffffff Node 1 MemBase 0000000200000000 Limit 00000003ffffffff Using 23 for the hash shift. Max adder is 3ffffffff After the patch : Node 0 MemBase 0000000000000000 Limit 00000001ffffffff Node 1 MemBase 0000000200000000 Limit 00000003ffffffff Using 33 for the hash shift. In this case, only 2 bytes of memnodemap[] are used, instead of 2048 Thank you Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> [-- Attachment #2: compute_hash_shift --] [-- Type: text/plain, Size: 2402 bytes --] --- linux-2.6.14-rc2/arch/x86_64/mm/numa.c 2005-09-20 05:00:41.000000000 +0200 +++ linux-2.6.14-rc2-ed/arch/x86_64/mm/numa.c 2005-09-29 18:45:42.000000000 +0200 @@ -38,38 +38,62 @@ int numa_off __initdata; -int __init compute_hash_shift(struct node *nodes, int numnodes) + +/* + * Given a shift value, try to populate memnodemap[] + * Returns : + * 1 if OK + * 0 if memnodmap[] too small (of shift too small) + * -1 if node overlap or lost ram (shift too big) + */ +static int __init populate_memnodemap( + const struct node *nodes, int numnodes, int shift) { int i; - int shift = 20; - unsigned long addr,maxend=0; - - for (i = 0; i < numnodes; i++) - if ((nodes[i].start != nodes[i].end) && (nodes[i].end > maxend)) - maxend = nodes[i].end; - - while ((1UL << shift) < (maxend / NODEMAPSIZE)) - shift++; + int res = -1; + unsigned long addr, end, lost ; - printk (KERN_DEBUG"Using %d for the hash shift. Max adder is %lx \n", - shift,maxend); - memset(memnodemap,0xff,sizeof(*memnodemap) * NODEMAPSIZE); + memset(memnodemap, 0xff, sizeof(memnodemap)); for (i = 0; i < numnodes; i++) { - if (nodes[i].start == nodes[i].end) + addr = (nodes[i].start >> shift); + end = ((nodes[i].end + 1) >> shift); + + /* check we dont loose more than 16 MB of ram */ + lost = (nodes[i].end + 1) - (end << shift); + if (lost >= (1<<24)) + return -1; + + if (addr >= end) continue; - for (addr = nodes[i].start; - addr < nodes[i].end; - addr += (1UL << shift)) { - if (memnodemap[addr >> shift] != 0xff) { - printk(KERN_INFO - "Your memory is not aligned you need to rebuild your kernel " - "with a bigger NODEMAPSIZE shift=%d adder=%lu\n", - shift,addr); + if (end >= NODEMAPSIZE) + return 0; + res = 1; + for (; addr < end; addr++) { + if (memnodemap[addr] != 0xff) return -1; - } - memnodemap[addr >> shift] = i; - } + memnodemap[addr] = i; + } } + return res; +} + +int __init compute_hash_shift(struct node *nodes, int numnodes) +{ + int shift = 20; + + while (populate_memnodemap(nodes, numnodes, shift + 1) >= 0) + shift++; + + printk(KERN_DEBUG "Using %d for the hash shift.\n", + shift); + + if (populate_memnodemap(nodes, numnodes, shift) != 1) { + printk(KERN_INFO + "Your memory is not aligned you need to rebuild your kernel " + "with a bigger NODEMAPSIZE shift=%d\n", + shift); + return -1; + } return shift; } ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [NUMA , x86_64] Why memnode_shift is chosen with the lowest possible value ? 2005-09-29 16:59 ` Eric Dumazet @ 2005-09-30 9:09 ` Eric Dumazet 2005-10-04 17:13 ` Andi Kleen 0 siblings, 1 reply; 16+ messages in thread From: Eric Dumazet @ 2005-09-30 9:09 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 1081 bytes --] Eric Dumazet a écrit : > Andi Kleen a écrit : > >>> Using memnode_shift=33 would access only 2 bytes from this >>> memnodemap[], touching fewer cache lines (well , one cache line). >>> kfree() and friends would be slightly faster, at least cache friendly. >> >> >> >> Agreed. Please send a patch. > My previous patch was not correct. Couly you please review this second version. [PATCH] NUMA,x86_64 : Compute the highest possible value for memnode_shift, in order to reduce footprint of memnodemap[] to the minimum, thus making all users (phys_to_nid(), kfree()), more cache friendly. Before the patch : Node 0 MemBase 0000000000000000 Limit 00000001ffffffff Node 1 MemBase 0000000200000000 Limit 00000003ffffffff Using 23 for the hash shift. Max adder is 3ffffffff After the patch : Node 0 MemBase 0000000000000000 Limit 00000001ffffffff Node 1 MemBase 0000000200000000 Limit 00000003ffffffff Using 33 for the hash shift. In this case, only 2 bytes of memnodemap[] are used, instead of 2048 Thank you Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> [-- Attachment #2: compute_hash_shift --] [-- Type: text/plain, Size: 2241 bytes --] --- linux-2.6.14-rc2/arch/x86_64/mm/numa.c 2005-09-20 05:00:41.000000000 +0200 +++ linux-2.6.14-rc2-ed/arch/x86_64/mm/numa.c 2005-09-30 11:04:18.000000000 +0200 @@ -38,38 +38,57 @@ int numa_off __initdata; -int __init compute_hash_shift(struct node *nodes, int numnodes) + +/* + * Given a shift value, try to populate memnodemap[] + * Returns : + * 1 if OK + * 0 if memnodmap[] too small (of shift too small) + * -1 if node overlap or lost ram (shift too big) + */ +static int __init populate_memnodemap( + const struct node *nodes, int numnodes, int shift) { int i; - int shift = 20; - unsigned long addr,maxend=0; - - for (i = 0; i < numnodes; i++) - if ((nodes[i].start != nodes[i].end) && (nodes[i].end > maxend)) - maxend = nodes[i].end; + int res = -1; + unsigned long addr, end; - while ((1UL << shift) < (maxend / NODEMAPSIZE)) - shift++; - - printk (KERN_DEBUG"Using %d for the hash shift. Max adder is %lx \n", - shift,maxend); - memset(memnodemap,0xff,sizeof(*memnodemap) * NODEMAPSIZE); + memset(memnodemap, 0xff, sizeof(memnodemap)); for (i = 0; i < numnodes; i++) { - if (nodes[i].start == nodes[i].end) + addr = nodes[i].start; + end = nodes[i].end; + if (addr >= end) continue; - for (addr = nodes[i].start; - addr < nodes[i].end; - addr += (1UL << shift)) { - if (memnodemap[addr >> shift] != 0xff) { - printk(KERN_INFO - "Your memory is not aligned you need to rebuild your kernel " - "with a bigger NODEMAPSIZE shift=%d adder=%lu\n", - shift,addr); + if ((end >> shift) >= NODEMAPSIZE) + return 0; + do { + if (memnodemap[addr >> shift] != 0xff) return -1; - } memnodemap[addr >> shift] = i; - } + addr += (1 << shift); + } while (addr < end); + res = 1; } + return res; +} + +int __init compute_hash_shift(struct node *nodes, int numnodes) +{ + int shift = 20; + + while (populate_memnodemap(nodes, numnodes, shift + 1) >= 0) + shift++; + + printk(KERN_DEBUG "Using %d for the hash shift.\n", + shift); + + if (populate_memnodemap(nodes, numnodes, shift) != 1) { + printk(KERN_INFO + "Your memory is not aligned you need to rebuild your kernel " + "with a bigger NODEMAPSIZE shift=%d\n", + shift); + return -1; + } return shift; } ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [NUMA , x86_64] Why memnode_shift is chosen with the lowest possible value ? 2005-09-30 9:09 ` Eric Dumazet @ 2005-10-04 17:13 ` Andi Kleen 2005-10-04 21:12 ` Eric Dumazet 0 siblings, 1 reply; 16+ messages in thread From: Andi Kleen @ 2005-10-04 17:13 UTC (permalink / raw) To: Eric Dumazet; +Cc: linux-kernel, discuss On Friday 30 September 2005 11:09, Eric Dumazet wrote: > + while (populate_memnodemap(nodes, numnodes, shift + 1) >= 0) > + shift++; Why shift+1 here? >+ if ((end >> shift) >= NODEMAPSIZE) >+ return 0; This should be >, not >= shouldn't it? -Andi P.S.: Please cc x86-64 patches to discuss@x86-64.org ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [NUMA , x86_64] Why memnode_shift is chosen with the lowest possible value ? 2005-10-04 17:13 ` Andi Kleen @ 2005-10-04 21:12 ` Eric Dumazet 0 siblings, 0 replies; 16+ messages in thread From: Eric Dumazet @ 2005-10-04 21:12 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, discuss Andi Kleen a écrit : > On Friday 30 September 2005 11:09, Eric Dumazet wrote: > >>+ while (populate_memnodemap(nodes, numnodes, shift + 1) >= 0) >>+ shift++; > > > > Why shift+1 here? Thank you Andi fo r reviewing this stuff The idea it to find the highest shift value, and to break the loop as soon as the (shift + 1) value gives us an "shift too big" error. Maybe you want to write : while (populate_memnodemap(nodes, numnodes, ++shift) >= 0) ; shift--; Well, thats only style... > > >>+ if ((end >> shift) >= NODEMAPSIZE) >>+ return 0; > > > This should be >, not >= shouldn't it? Let's take an example end = 0xffffffff; start = 0xfff00000; shift = 20 Suppose that NODEMAPSIZE == (end >> shift) == 0xfff If the test is changed to : if ((end >> shift) > NODEMAPSIZE) return 0; We could do one of the iteration with (addr < end) but (addr >> shift) == NODEMAPSIZE if (memnodemap[NODEMAPSIZE] != 0xff) return -1; memnodemap[NODMAPSIZE] = i; Thats bound violation of memnodemap[] AFAIK, I wonder why NODEMAPSIZE is 0xfff and not 0x1000, because this off by one make half of memnodemap[] to be unused for power of two ram size. > > -Andi > > P.S.: Please cc x86-64 patches to discuss@x86-64.org Ah thank you Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/3] Demand faulting for huge pages 2005-09-28 20:25 [PATCH 0/3] Demand faulting for huge pages Adam Litke ` (2 preceding siblings ...) 2005-09-28 20:33 ` [PATCH 3/3 htlb-acct] " Adam Litke @ 2005-09-29 13:32 ` Hugh Dickins 2005-10-06 15:22 ` Adam Litke 3 siblings, 1 reply; 16+ messages in thread From: Hugh Dickins @ 2005-09-29 13:32 UTC (permalink / raw) To: Andrew Morton; +Cc: Adam Litke, William Irwin, linux-kernel, linux-mm On Wed, 28 Sep 2005, Adam Litke wrote: > Hi Andrew. Can we give hugetlb demand faulting a spin in the mm tree? > And could people with alpha, sparc, and ia64 machines give them a good > spin? I haven't been able to test those arches yet. It's going to be a little confusing if these go in while I'm moving the page_table_lock inwards. My patches don't make a big difference to hugetlb (I've not attempted splitting the lock at all for hugetlb - there would be per-arch implementation issues and very little point - though more point if we do move to hugetlb faulting). But I'm ill at ease with changing the locking at one end while it's unclear whether it's right at the other end. Currently Adam's patches don't include my hugetlb changes already in -mm; and I don't see any attention in his patches to the issue of hugetlb file truncation, which I was fixing up in those. The current hugetlb_prefault guards against this with i_sem held: which _appears_ to be a lock ordering violation, but may not be, since the official lock ordering is determined by the possibility of fault within write, whereas hugetlb mmaps were never faulting. Presumably on-demand hugetlb faulting would entail truncate_count checking like do_no_page, and corresponding code in hugetlbfs. I've no experience of hugetlb use. Personally, I'd be very happy with a decision to disallow truncation of hugetlb files (seems odd to allow ftruncate when read and write are not allowed, and the size normally determined automatically by mmap size); but I have to assume that it's been allowed for good reason. > - htlb-get_user_pages removes an optimization that is no longer valid > when demand faulting huge pages > > - htlb-fault moves the fault logic from hugetlb_prefault() to > hugetlb_pte_fault() and find_get_huge_page(). > > - htlb-acct adds an overcommit check to maintain the no-overcommit > semantics provided by hugetlb_prefault() Yes, I found that last one rather strange too. Doing it at creation time based on i_size (and updating if i_size is allowed to change) is one possibility; doing it at fault time whenever newly allocated is another possibility; but these pagevec lookups at mmap time seem odd. Hugh ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/3] Demand faulting for huge pages 2005-09-29 13:32 ` [PATCH 0/3] Demand faulting for huge pages Hugh Dickins @ 2005-10-06 15:22 ` Adam Litke 0 siblings, 0 replies; 16+ messages in thread From: Adam Litke @ 2005-10-06 15:22 UTC (permalink / raw) To: Hugh Dickins Cc: "David Gibson david", Andrew Morton, William Irwin, linux-kernel, linux-mm On Thu, 2005-09-29 at 14:32 +0100, Hugh Dickins wrote: > On Wed, 28 Sep 2005, Adam Litke wrote: > > > Hi Andrew. Can we give hugetlb demand faulting a spin in the mm tree? > > And could people with alpha, sparc, and ia64 machines give them a good > > spin? I haven't been able to test those arches yet. > > It's going to be a little confusing if these go in while I'm moving > the page_table_lock inwards. My patches don't make a big difference > to hugetlb (I've not attempted splitting the lock at all for hugetlb - > there would be per-arch implementation issues and very little point - > though more point if we do move to hugetlb faulting). But I'm ill at > ease with changing the locking at one end while it's unclear whether > it's right at the other end. > > Currently Adam's patches don't include my hugetlb changes already in > -mm; and I don't see any attention in his patches to the issue of > hugetlb file truncation, which I was fixing up in those. > > The current hugetlb_prefault guards against this with i_sem held: > which _appears_ to be a lock ordering violation, but may not be, > since the official lock ordering is determined by the possibility > of fault within write, whereas hugetlb mmaps were never faulting. > > Presumably on-demand hugetlb faulting would entail truncate_count > checking like do_no_page, and corresponding code in hugetlbfs. > > I've no experience of hugetlb use. Personally, I'd be very happy with > a decision to disallow truncation of hugetlb files (seems odd to allow > ftruncate when read and write are not allowed, and the size normally > determined automatically by mmap size); but I have to assume that it's > been allowed for good reason. If I were to spend time coding up a patch to remove truncation support for hugetlbfs, would it be something other people would want to see merged as well? -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2005-10-06 15:42 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-09-28 20:25 [PATCH 0/3] Demand faulting for huge pages Adam Litke 2005-09-28 20:31 ` [PATCH 1/3 htlb-get_user_pages] " Adam Litke 2005-09-28 20:32 ` [PATCH 2/3 htlb-fault] " Adam Litke 2005-09-29 6:09 ` Andrew Morton 2005-09-29 6:10 ` Andrew Morton 2005-09-28 20:33 ` [PATCH 3/3 htlb-acct] " Adam Litke 2005-09-29 6:20 ` Andrew Morton 2005-09-29 9:45 ` Andi Kleen 2005-09-29 13:40 ` [NUMA , x86_64] Why memnode_shift is chosen with the lowest possible value ? Eric Dumazet 2005-09-29 13:43 ` Andi Kleen 2005-09-29 16:59 ` Eric Dumazet 2005-09-30 9:09 ` Eric Dumazet 2005-10-04 17:13 ` Andi Kleen 2005-10-04 21:12 ` Eric Dumazet 2005-09-29 13:32 ` [PATCH 0/3] Demand faulting for huge pages Hugh Dickins 2005-10-06 15:22 ` Adam Litke
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox