From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx111.postini.com [74.125.245.111]) by kanga.kvack.org (Postfix) with SMTP id CB3F16B0005 for ; Mon, 4 Feb 2013 02:17:16 -0500 (EST) Received: by mail-da0-f54.google.com with SMTP id n2so2533248dad.27 for ; Sun, 03 Feb 2013 23:17:16 -0800 (PST) From: Michel Lespinasse Subject: [PATCH v2 0/3] fixes for large mm_populate() and munlock() operations Date: Sun, 3 Feb 2013 23:17:09 -0800 Message-Id: <1359962232-20811-1-git-send-email-walken@google.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrea Arcangeli , Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org These 3 changes are to improve the handling of large mm_populate and munlock operations. They apply on top of mmotm (in particular, they depend on both my prior mm_populate work and Kirill's "thp: avoid dumping huge zero page" change). - Patch 1 fixes an integer overflow issue when populating 2^32 pages. The nr_pages argument to get_user_pages would overflow, resulting in 0 pages being processed per iteration. I am proposing to simply convert the nr_pages argument to a long. - Patch 2 accelerates populating regions with THP pages. get_user_pages() can increment the address by a huge page size in this case instead of a small page size, and avoid repeated mm->page_table_lock acquisitions. This fixes an issue reported by Roman Dubtsov where populating regions via mmap MAP_POPULATE was significantly slower than doing so by touching pages from userspace. - Patch 3 is a similar acceleration for the munlock case. I would actually like to get Andrea's attention on this one, as I can't explain how munlock_vma_page() is safe against racing with split_huge_page(). Note that patches 1-2 are logically independent of patch 3, so if the discussion of patch 3 takes too long I would ask Andrew to consider merging patches 1-2 first. Changes since v1: - Andrew accepted patch 1 into his -mm tree but suggested the nr_pages argument type should actually be unsigned long; I am sending this as a "fix" for the previous patch 1 to be collapsed over the previous one. - In patch 2, I am adding a separate follow_page_mask() function so that the callers to the original follow_page() don't have to be modified to ignore the returned page_mask (following another suggestion from Andrew). Also the page_mask argument type was changed to unsigned int. - In patch 3, I similarly changed the page_mask values to unsigned int. Michel Lespinasse (3): fix mm: use long type for page counts in mm_populate() and get_user_pages() mm: accelerate mm_populate() treatment of THP pages mm: accelerate munlock() treatment of THP pages include/linux/hugetlb.h | 2 +- include/linux/mm.h | 24 +++++++++++++++++------- mm/hugetlb.c | 8 ++++---- mm/internal.h | 2 +- mm/memory.c | 43 +++++++++++++++++++++++++++++-------------- mm/mlock.c | 34 ++++++++++++++++++++++------------ mm/nommu.c | 6 ++++-- 7 files changed, 78 insertions(+), 41 deletions(-) -- 1.8.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx131.postini.com [74.125.245.131]) by kanga.kvack.org (Postfix) with SMTP id 455E46B0007 for ; Mon, 4 Feb 2013 02:17:18 -0500 (EST) Received: by mail-pa0-f50.google.com with SMTP id fa11so605348pad.23 for ; Sun, 03 Feb 2013 23:17:17 -0800 (PST) From: Michel Lespinasse Subject: [PATCH v2 1/3] fix mm: use long type for page counts in mm_populate() and get_user_pages() Date: Sun, 3 Feb 2013 23:17:10 -0800 Message-Id: <1359962232-20811-2-git-send-email-walken@google.com> In-Reply-To: <1359962232-20811-1-git-send-email-walken@google.com> References: <1359962232-20811-1-git-send-email-walken@google.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrea Arcangeli , Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Andrew suggested I make the nr_pages argument an unsigned long rather than just a long. I was initially worried that nr_pages would be compared with signed longs, but this isn't the case, so his suggestion is perfectly valid. Sending as a 'fix' change, to be collapsed with the original in -mm. Signed-off-by: Michel Lespinasse --- include/linux/hugetlb.h | 2 +- include/linux/mm.h | 11 ++++++----- mm/hugetlb.c | 8 ++++---- mm/memory.c | 12 ++++++------ mm/mlock.c | 2 +- 5 files changed, 18 insertions(+), 17 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index fc6ed17cfd17..eedc334fb6f5 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -45,7 +45,7 @@ int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, - unsigned long *, long *, long, unsigned int flags); + unsigned long *, unsigned long *, long, unsigned int); void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, struct page *); void __unmap_hugepage_range_final(struct mmu_gather *tlb, diff --git a/include/linux/mm.h b/include/linux/mm.h index d5716094f191..3d9fbcf9fa94 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1041,12 +1041,13 @@ extern int access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf, int len, int write); long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, long len, unsigned int foll_flags, - struct page **pages, struct vm_area_struct **vmas, - int *nonblocking); + unsigned long start, unsigned long nr_pages, + unsigned int foll_flags, struct page **pages, + struct vm_area_struct **vmas, int *nonblocking); long get_user_pages(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, long nr_pages, int write, int force, - struct page **pages, struct vm_area_struct **vmas); + unsigned long start, unsigned long nr_pages, + int write, int force, struct page **pages, + struct vm_area_struct **vmas); int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); struct kvec; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 4ad07221ce60..951873c8f57e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2926,12 +2926,12 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address, long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, - unsigned long *position, long *length, long i, - unsigned int flags) + unsigned long *position, unsigned long *nr_pages, + long i, unsigned int flags) { unsigned long pfn_offset; unsigned long vaddr = *position; - long remainder = *length; + unsigned long remainder = *nr_pages; struct hstate *h = hstate_vma(vma); spin_lock(&mm->page_table_lock); @@ -3001,7 +3001,7 @@ same_page: } } spin_unlock(&mm->page_table_lock); - *length = remainder; + *nr_pages = remainder; *position = vaddr; return i ? i : -EFAULT; diff --git a/mm/memory.c b/mm/memory.c index 381b78c20d84..f0b6b2b798c4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1674,14 +1674,14 @@ static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long add * you need some special @gup_flags. */ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, long nr_pages, unsigned int gup_flags, - struct page **pages, struct vm_area_struct **vmas, - int *nonblocking) + unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas, int *nonblocking) { long i; unsigned long vm_flags; - if (nr_pages <= 0) + if (!nr_pages) return 0; VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET)); @@ -1978,8 +1978,8 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, * See also get_user_pages_fast, for performance critical applications. */ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, long nr_pages, int write, int force, - struct page **pages, struct vm_area_struct **vmas) + unsigned long start, unsigned long nr_pages, int write, + int force, struct page **pages, struct vm_area_struct **vmas) { int flags = FOLL_TOUCH; diff --git a/mm/mlock.c b/mm/mlock.c index e1fa9e4b0a66..6baaf4b0e591 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -160,7 +160,7 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma, { struct mm_struct *mm = vma->vm_mm; unsigned long addr = start; - long nr_pages = (end - start) / PAGE_SIZE; + unsigned long nr_pages = (end - start) / PAGE_SIZE; int gup_flags; VM_BUG_ON(start & ~PAGE_MASK); -- 1.8.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx160.postini.com [74.125.245.160]) by kanga.kvack.org (Postfix) with SMTP id 625826B0008 for ; Mon, 4 Feb 2013 02:17:20 -0500 (EST) Received: by mail-da0-f49.google.com with SMTP id v40so2526714dad.8 for ; Sun, 03 Feb 2013 23:17:19 -0800 (PST) From: Michel Lespinasse Subject: [PATCH v2 2/3] mm: accelerate mm_populate() treatment of THP pages Date: Sun, 3 Feb 2013 23:17:11 -0800 Message-Id: <1359962232-20811-3-git-send-email-walken@google.com> In-Reply-To: <1359962232-20811-1-git-send-email-walken@google.com> References: <1359962232-20811-1-git-send-email-walken@google.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrea Arcangeli , Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org This change adds a follow_page_mask function which is equivalent to follow_page, but with an extra page_mask argument. follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters a THP page, and to 0 in other cases. __get_user_pages() makes use of this in order to accelerate populating THP ranges - that is, when both the pages and vmas arrays are NULL, we don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and we also avoid taking mm->page_table_lock that many times). Signed-off-by: Michel Lespinasse --- include/linux/mm.h | 13 +++++++++++-- mm/memory.c | 31 +++++++++++++++++++++++-------- mm/nommu.c | 6 ++++-- 3 files changed, 38 insertions(+), 12 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 3d9fbcf9fa94..31e4d42002ee 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1636,8 +1636,17 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); -struct page *follow_page(struct vm_area_struct *, unsigned long address, - unsigned int foll_flags); +struct page *follow_page_mask(struct vm_area_struct *vma, + unsigned long address, unsigned int foll_flags, + unsigned int *page_mask); + +static inline struct page *follow_page(struct vm_area_struct *vma, + unsigned long address, unsigned int foll_flags) +{ + unsigned int unused_page_mask; + return follow_page_mask(vma, address, foll_flags, &unused_page_mask); +} + #define FOLL_WRITE 0x01 /* check pte is writable */ #define FOLL_TOUCH 0x02 /* mark page accessed */ #define FOLL_GET 0x04 /* do get_page on page */ diff --git a/mm/memory.c b/mm/memory.c index f0b6b2b798c4..52c8599e7fe4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1458,10 +1458,11 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, EXPORT_SYMBOL_GPL(zap_vma_ptes); /** - * follow_page - look up a page descriptor from a user-virtual address + * follow_page_mask - look up a page descriptor from a user-virtual address * @vma: vm_area_struct mapping @address * @address: virtual address to look up * @flags: flags modifying lookup behaviour + * @page_mask: on output, *page_mask is set according to the size of the page * * @flags can have FOLL_ flags set, defined in * @@ -1469,8 +1470,9 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes); * an error pointer if there is a mapping to something not represented * by a page descriptor (see also vm_normal_page()). */ -struct page *follow_page(struct vm_area_struct *vma, unsigned long address, - unsigned int flags) +struct page *follow_page_mask(struct vm_area_struct *vma, + unsigned long address, unsigned int flags, + unsigned int *page_mask) { pgd_t *pgd; pud_t *pud; @@ -1480,6 +1482,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, struct page *page; struct mm_struct *mm = vma->vm_mm; + *page_mask = 0; + page = follow_huge_addr(mm, address, flags & FOLL_WRITE); if (!IS_ERR(page)) { BUG_ON(flags & FOLL_GET); @@ -1526,6 +1530,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, page = follow_trans_huge_pmd(vma, address, pmd, flags); spin_unlock(&mm->page_table_lock); + *page_mask = HPAGE_PMD_NR - 1; goto out; } } else @@ -1680,6 +1685,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, { long i; unsigned long vm_flags; + unsigned int page_mask; if (!nr_pages) return 0; @@ -1757,6 +1763,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, get_page(page); } pte_unmap(pte); + page_mask = 0; goto next_page; } @@ -1774,6 +1781,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, do { struct page *page; unsigned int foll_flags = gup_flags; + unsigned int page_increm; /* * If we have a pending SIGKILL, don't keep faulting @@ -1783,7 +1791,8 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, return i ? i : -ERESTARTSYS; cond_resched(); - while (!(page = follow_page(vma, start, foll_flags))) { + while (!(page = follow_page_mask(vma, start, + foll_flags, &page_mask))) { int ret; unsigned int fault_flags = 0; @@ -1857,13 +1866,19 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, flush_anon_page(vma, page, start); flush_dcache_page(page); + page_mask = 0; } next_page: - if (vmas) + if (vmas) { vmas[i] = vma; - i++; - start += PAGE_SIZE; - nr_pages--; + page_mask = 0; + } + page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask); + if (page_increm > nr_pages) + page_increm = nr_pages; + i += page_increm; + start += page_increm * PAGE_SIZE; + nr_pages -= page_increm; } while (nr_pages && start < vma->vm_end); } while (nr_pages); return i; diff --git a/mm/nommu.c b/mm/nommu.c index 429a3d5217fa..9a6a25181267 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1817,9 +1817,11 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, return ret; } -struct page *follow_page(struct vm_area_struct *vma, unsigned long address, - unsigned int foll_flags) +struct page *follow_page_mask(struct vm_area_struct *vma, + unsigned long address, unsigned int flags, + unsigned int *page_mask) { + *page_mask = 0; return NULL; } -- 1.8.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx111.postini.com [74.125.245.111]) by kanga.kvack.org (Postfix) with SMTP id 126CD6B0009 for ; Mon, 4 Feb 2013 02:17:22 -0500 (EST) Received: by mail-pa0-f45.google.com with SMTP id kl14so492504pab.4 for ; Sun, 03 Feb 2013 23:17:21 -0800 (PST) From: Michel Lespinasse Subject: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages Date: Sun, 3 Feb 2013 23:17:12 -0800 Message-Id: <1359962232-20811-4-git-send-email-walken@google.com> In-Reply-To: <1359962232-20811-1-git-send-email-walken@google.com> References: <1359962232-20811-1-git-send-email-walken@google.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrea Arcangeli , Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE at a time. When munlocking THP pages (or the huge zero page), this resulted in taking the mm->page_table_lock 512 times in a row. We can do better by making use of the page_mask returned by follow_page_mask (for the huge zero page case), or the size of the page munlock_vma_page() operated on (for the true THP page case). Note - I am sending this as RFC only for now as I can't currently put my finger on what if anything prevents split_huge_page() from operating concurrently on the same page as munlock_vma_page(), which would mess up our NR_MLOCK statistics. Is this a latent bug or is there a subtle point I missed here ? Signed-off-by: Michel Lespinasse --- mm/internal.h | 2 +- mm/mlock.c | 32 +++++++++++++++++++++----------- 2 files changed, 22 insertions(+), 12 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 1c0c4cc0fcf7..8562de0a5197 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -195,7 +195,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *vma, * must be called with vma's mmap_sem held for read or write, and page locked. */ extern void mlock_vma_page(struct page *page); -extern void munlock_vma_page(struct page *page); +extern unsigned int munlock_vma_page(struct page *page); /* * Clear the page's PageMlocked(). This can be useful in a situation where diff --git a/mm/mlock.c b/mm/mlock.c index 6baaf4b0e591..486702edee35 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -102,13 +102,14 @@ void mlock_vma_page(struct page *page) * can't isolate the page, we leave it for putback_lru_page() and vmscan * [page_referenced()/try_to_unmap()] to deal with. */ -void munlock_vma_page(struct page *page) +unsigned int munlock_vma_page(struct page *page) { + unsigned int nr_pages = hpage_nr_pages(page); + BUG_ON(!PageLocked(page)); if (TestClearPageMlocked(page)) { - mod_zone_page_state(page_zone(page), NR_MLOCK, - -hpage_nr_pages(page)); + mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); if (!isolate_lru_page(page)) { int ret = SWAP_AGAIN; @@ -141,6 +142,8 @@ void munlock_vma_page(struct page *page) count_vm_event(UNEVICTABLE_PGMUNLOCKED); } } + + return nr_pages; } /** @@ -159,7 +162,6 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, int *nonblocking) { struct mm_struct *mm = vma->vm_mm; - unsigned long addr = start; unsigned long nr_pages = (end - start) / PAGE_SIZE; int gup_flags; @@ -185,7 +187,7 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma, if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) gup_flags |= FOLL_FORCE; - return __get_user_pages(current, mm, addr, nr_pages, gup_flags, + return __get_user_pages(current, mm, start, nr_pages, gup_flags, NULL, NULL, nonblocking); } @@ -222,13 +224,12 @@ static int __mlock_posix_error_return(long retval) void munlock_vma_pages_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) { - unsigned long addr; - - lru_add_drain(); vma->vm_flags &= ~VM_LOCKED; - for (addr = start; addr < end; addr += PAGE_SIZE) { + while (start < end) { struct page *page; + unsigned int page_mask, page_increm; + /* * Although FOLL_DUMP is intended for get_dump_page(), * it just so happens that its special treatment of the @@ -236,13 +237,22 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, * suits munlock very well (and if somehow an abnormal page * has sneaked into the range, we won't oops here: great). */ - page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP); + page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP, + &page_mask); if (page && !IS_ERR(page)) { lock_page(page); - munlock_vma_page(page); + lru_add_drain(); + /* + * Any THP page found by follow_page_mask() may have + * gotten split before reaching munlock_vma_page(), + * so we need to recompute the page_mask here. + */ + page_mask = munlock_vma_page(page); unlock_page(page); put_page(page); } + page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask); + start += page_increm * PAGE_SIZE; cond_resched(); } } -- 1.8.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx149.postini.com [74.125.245.149]) by kanga.kvack.org (Postfix) with SMTP id D19836B0005 for ; Wed, 6 Feb 2013 18:44:58 -0500 (EST) Message-ID: <5112EAE8.8070503@oracle.com> Date: Wed, 06 Feb 2013 18:44:40 -0500 From: Sasha Levin MIME-Version: 1.0 Subject: Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages References: <1359962232-20811-1-git-send-email-walken@google.com> <1359962232-20811-4-git-send-email-walken@google.com> In-Reply-To: <1359962232-20811-4-git-send-email-walken@google.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michel Lespinasse Cc: Andrea Arcangeli , Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org On 02/04/2013 02:17 AM, Michel Lespinasse wrote: > munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE > at a time. When munlocking THP pages (or the huge zero page), this resulted > in taking the mm->page_table_lock 512 times in a row. > > We can do better by making use of the page_mask returned by follow_page_mask > (for the huge zero page case), or the size of the page munlock_vma_page() > operated on (for the true THP page case). > > Note - I am sending this as RFC only for now as I can't currently put > my finger on what if anything prevents split_huge_page() from operating > concurrently on the same page as munlock_vma_page(), which would mess > up our NR_MLOCK statistics. Is this a latent bug or is there a subtle > point I missed here ? > > Signed-off-by: Michel Lespinasse Hi Michel, Fuzzing with trinity inside a KVM tools guest produces a steady stream of: [ 51.823275] ------------[ cut here ]------------ [ 51.823302] kernel BUG at include/linux/page-flags.h:421! [ 51.823307] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 51.823307] Dumping ftrace buffer: [ 51.823314] (ftrace buffer empty) [ 51.823314] Modules linked in: [ 51.823314] CPU 2 [ 51.823314] Pid: 7116, comm: trinity Tainted: G W 3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273 [ 51.823316] RIP: 0010:[] [] munlock_vma_page+0x12/0xf0 [ 51.823317] RSP: 0018:ffff880009641bb8 EFLAGS: 00010282 [ 51.823319] RAX: 011ffc0000008001 RBX: ffffea0000410040 RCX: 0000000000000000 [ 51.823320] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000410040 [ 51.823321] RBP: ffff880009641bc8 R08: 0000000000000000 R09: 0000000000000000 [ 51.823322] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880009633958 [ 51.823324] R13: 0000000001252000 R14: ffffea0000410040 R15: 00000000000000ff [ 51.823326] FS: 00007fe7a9046700(0000) GS:ffff88000ba00000(0000) knlGS:0000000000000000 [ 51.823327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 51.823328] CR2: 00007fc583b90fcb CR3: 0000000009bc8000 CR4: 00000000000406e0 [ 51.823334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 51.823338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 51.823340] Process trinity (pid: 7116, threadinfo ffff880009640000, task ffff880009638000) [ 51.823341] Stack: [ 51.823344] 0000000000a01000 ffff880009633958 ffff880009641c08 ffffffff812429bd [ 51.823373] ffff880009638000 000001ff09638000 ffff880009ade000 ffff880009633958 [ 51.823373] ffff880009638810 ffff880009ade098 ffff880009641cb8 ffffffff81246d81 [ 51.823373] Call Trace: [ 51.823373] [] munlock_vma_pages_range+0x8d/0xf0 [ 51.823373] [] exit_mmap+0x51/0x170 [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 [ 51.823373] [] ? kmem_cache_free+0x22f/0x3b0 [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 [ 51.823373] [] mmput+0x77/0xe0 [ 51.823377] [] exit_mm+0x113/0x120 [ 51.823381] [] ? _raw_spin_unlock_irq+0x51/0x80 [ 51.823384] [] do_exit+0x24a/0x590 [ 51.823387] [] do_group_exit+0x8a/0xc0 [ 51.823390] [] get_signal_to_deliver+0x501/0x5b0 [ 51.823394] [] do_signal+0x42/0x110 [ 51.823399] [] ? rcu_eqs_exit_common+0x64/0x340 [ 51.823404] [] ? trace_hardirqs_on+0xd/0x10 [ 51.823407] [] ? trace_hardirqs_on_caller+0x128/0x160 [ 51.823409] [] ? trace_hardirqs_on+0xd/0x10 [ 51.823412] [] do_notify_resume+0x48/0xa0 [ 51.823415] [] retint_signal+0x4d/0x92 [ 51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48 8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 01 48 8b [ 51.823449] RIP [] munlock_vma_page+0x12/0xf0 [ 51.823450] RSP [ 51.826846] ---[ end trace a7919e7f17c0a72a ]--- Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx203.postini.com [74.125.245.203]) by kanga.kvack.org (Postfix) with SMTP id 7883F6B0005 for ; Wed, 6 Feb 2013 21:50:56 -0500 (EST) Received: from /spool/local by e23smtp04.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 7 Feb 2013 12:42:16 +1000 Received: from d23relay05.au.ibm.com (d23relay05.au.ibm.com [9.190.235.152]) by d23dlp01.au.ibm.com (Postfix) with ESMTP id E99232CE8051 for ; Thu, 7 Feb 2013 13:50:49 +1100 (EST) Received: from d23av02.au.ibm.com (d23av02.au.ibm.com [9.190.235.138]) by d23relay05.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r172cZxL1442302 for ; Thu, 7 Feb 2013 13:38:37 +1100 Received: from d23av02.au.ibm.com (loopback [127.0.0.1]) by d23av02.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r172omRu021266 for ; Thu, 7 Feb 2013 13:50:48 +1100 Message-ID: <1360205438.13550.11.camel@ThinkPad-T5421.cn.ibm.com> Subject: Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages From: Li Zhong Date: Thu, 07 Feb 2013 10:50:38 +0800 In-Reply-To: <5112EAE8.8070503@oracle.com> References: <1359962232-20811-1-git-send-email-walken@google.com> <1359962232-20811-4-git-send-email-walken@google.com> <5112EAE8.8070503@oracle.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Mime-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Sasha Levin Cc: Michel Lespinasse , Andrea Arcangeli , Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org On Wed, 2013-02-06 at 18:44 -0500, Sasha Levin wrote: > On 02/04/2013 02:17 AM, Michel Lespinasse wrote: > > munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE > > at a time. When munlocking THP pages (or the huge zero page), this resulted > > in taking the mm->page_table_lock 512 times in a row. > > > > We can do better by making use of the page_mask returned by follow_page_mask > > (for the huge zero page case), or the size of the page munlock_vma_page() > > operated on (for the true THP page case). > > > > Note - I am sending this as RFC only for now as I can't currently put > > my finger on what if anything prevents split_huge_page() from operating > > concurrently on the same page as munlock_vma_page(), which would mess > > up our NR_MLOCK statistics. Is this a latent bug or is there a subtle > > point I missed here ? > > > > Signed-off-by: Michel Lespinasse > > Hi Michel, > > Fuzzing with trinity inside a KVM tools guest produces a steady stream of: > > > [ 51.823275] ------------[ cut here ]------------ > [ 51.823302] kernel BUG at include/linux/page-flags.h:421! > [ 51.823307] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 51.823307] Dumping ftrace buffer: > [ 51.823314] (ftrace buffer empty) > [ 51.823314] Modules linked in: > [ 51.823314] CPU 2 > [ 51.823314] Pid: 7116, comm: trinity Tainted: G W 3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273 > [ 51.823316] RIP: 0010:[] [] munlock_vma_page+0x12/0xf0 > [ 51.823317] RSP: 0018:ffff880009641bb8 EFLAGS: 00010282 > [ 51.823319] RAX: 011ffc0000008001 RBX: ffffea0000410040 RCX: 0000000000000000 > [ 51.823320] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000410040 > [ 51.823321] RBP: ffff880009641bc8 R08: 0000000000000000 R09: 0000000000000000 > [ 51.823322] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880009633958 > [ 51.823324] R13: 0000000001252000 R14: ffffea0000410040 R15: 00000000000000ff > [ 51.823326] FS: 00007fe7a9046700(0000) GS:ffff88000ba00000(0000) knlGS:0000000000000000 > [ 51.823327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 51.823328] CR2: 00007fc583b90fcb CR3: 0000000009bc8000 CR4: 00000000000406e0 > [ 51.823334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 51.823338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 51.823340] Process trinity (pid: 7116, threadinfo ffff880009640000, task ffff880009638000) > [ 51.823341] Stack: > [ 51.823344] 0000000000a01000 ffff880009633958 ffff880009641c08 ffffffff812429bd > [ 51.823373] ffff880009638000 000001ff09638000 ffff880009ade000 ffff880009633958 > [ 51.823373] ffff880009638810 ffff880009ade098 ffff880009641cb8 ffffffff81246d81 > [ 51.823373] Call Trace: > [ 51.823373] [] munlock_vma_pages_range+0x8d/0xf0 > [ 51.823373] [] exit_mmap+0x51/0x170 > [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 > [ 51.823373] [] ? kmem_cache_free+0x22f/0x3b0 > [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 > [ 51.823373] [] mmput+0x77/0xe0 > [ 51.823377] [] exit_mm+0x113/0x120 > [ 51.823381] [] ? _raw_spin_unlock_irq+0x51/0x80 > [ 51.823384] [] do_exit+0x24a/0x590 > [ 51.823387] [] do_group_exit+0x8a/0xc0 > [ 51.823390] [] get_signal_to_deliver+0x501/0x5b0 > [ 51.823394] [] do_signal+0x42/0x110 > [ 51.823399] [] ? rcu_eqs_exit_common+0x64/0x340 > [ 51.823404] [] ? trace_hardirqs_on+0xd/0x10 > [ 51.823407] [] ? trace_hardirqs_on_caller+0x128/0x160 > [ 51.823409] [] ? trace_hardirqs_on+0xd/0x10 > [ 51.823412] [] do_notify_resume+0x48/0xa0 > [ 51.823415] [] retint_signal+0x4d/0x92 > [ 51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48 > 8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 01 48 8b > [ 51.823449] RIP [] munlock_vma_page+0x12/0xf0 > [ 51.823450] RSP > [ 51.826846] ---[ end trace a7919e7f17c0a72a ]--- > The similar warning prevents my system from booting. And it seems to me that in munlock_vma_pages_range(), the page_mask needs be the page number returned from munlock_vma_page() minus 1. And the following fix solved my problem. Would you please have a try? Thanks, Zhong ================ diff --git a/mm/mlock.c b/mm/mlock.c index af1d115..1e3d794 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -255,7 +255,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, unlock_page(page); put_page(page); } - page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask); + page_increm = 1 + (~(start >> PAGE_SHIFT) & (page_mask-1)); start += page_increm * PAGE_SIZE; cond_resched(); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx133.postini.com [74.125.245.133]) by kanga.kvack.org (Postfix) with SMTP id E47086B0005 for ; Thu, 7 Feb 2013 00:42:50 -0500 (EST) Message-ID: <51133EC6.4010902@oracle.com> Date: Thu, 07 Feb 2013 00:42:30 -0500 From: Sasha Levin MIME-Version: 1.0 Subject: Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages References: <1359962232-20811-1-git-send-email-walken@google.com> <1359962232-20811-4-git-send-email-walken@google.com> <5112EAE8.8070503@oracle.com> <1360205438.13550.11.camel@ThinkPad-T5421.cn.ibm.com> In-Reply-To: <1360205438.13550.11.camel@ThinkPad-T5421.cn.ibm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Li Zhong Cc: Michel Lespinasse , Andrea Arcangeli , Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org On 02/06/2013 09:50 PM, Li Zhong wrote: > On Wed, 2013-02-06 at 18:44 -0500, Sasha Levin wrote: >> On 02/04/2013 02:17 AM, Michel Lespinasse wrote: >>> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE >>> at a time. When munlocking THP pages (or the huge zero page), this resulted >>> in taking the mm->page_table_lock 512 times in a row. >>> >>> We can do better by making use of the page_mask returned by follow_page_mask >>> (for the huge zero page case), or the size of the page munlock_vma_page() >>> operated on (for the true THP page case). >>> >>> Note - I am sending this as RFC only for now as I can't currently put >>> my finger on what if anything prevents split_huge_page() from operating >>> concurrently on the same page as munlock_vma_page(), which would mess >>> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle >>> point I missed here ? >>> >>> Signed-off-by: Michel Lespinasse >> >> Hi Michel, >> >> Fuzzing with trinity inside a KVM tools guest produces a steady stream of: >> >> >> [ 51.823275] ------------[ cut here ]------------ >> [ 51.823302] kernel BUG at include/linux/page-flags.h:421! >> [ 51.823307] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC >> [ 51.823307] Dumping ftrace buffer: >> [ 51.823314] (ftrace buffer empty) >> [ 51.823314] Modules linked in: >> [ 51.823314] CPU 2 >> [ 51.823314] Pid: 7116, comm: trinity Tainted: G W 3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273 >> [ 51.823316] RIP: 0010:[] [] munlock_vma_page+0x12/0xf0 >> [ 51.823317] RSP: 0018:ffff880009641bb8 EFLAGS: 00010282 >> [ 51.823319] RAX: 011ffc0000008001 RBX: ffffea0000410040 RCX: 0000000000000000 >> [ 51.823320] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000410040 >> [ 51.823321] RBP: ffff880009641bc8 R08: 0000000000000000 R09: 0000000000000000 >> [ 51.823322] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880009633958 >> [ 51.823324] R13: 0000000001252000 R14: ffffea0000410040 R15: 00000000000000ff >> [ 51.823326] FS: 00007fe7a9046700(0000) GS:ffff88000ba00000(0000) knlGS:0000000000000000 >> [ 51.823327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 51.823328] CR2: 00007fc583b90fcb CR3: 0000000009bc8000 CR4: 00000000000406e0 >> [ 51.823334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> [ 51.823338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> [ 51.823340] Process trinity (pid: 7116, threadinfo ffff880009640000, task ffff880009638000) >> [ 51.823341] Stack: >> [ 51.823344] 0000000000a01000 ffff880009633958 ffff880009641c08 ffffffff812429bd >> [ 51.823373] ffff880009638000 000001ff09638000 ffff880009ade000 ffff880009633958 >> [ 51.823373] ffff880009638810 ffff880009ade098 ffff880009641cb8 ffffffff81246d81 >> [ 51.823373] Call Trace: >> [ 51.823373] [] munlock_vma_pages_range+0x8d/0xf0 >> [ 51.823373] [] exit_mmap+0x51/0x170 >> [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 >> [ 51.823373] [] ? kmem_cache_free+0x22f/0x3b0 >> [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 >> [ 51.823373] [] mmput+0x77/0xe0 >> [ 51.823377] [] exit_mm+0x113/0x120 >> [ 51.823381] [] ? _raw_spin_unlock_irq+0x51/0x80 >> [ 51.823384] [] do_exit+0x24a/0x590 >> [ 51.823387] [] do_group_exit+0x8a/0xc0 >> [ 51.823390] [] get_signal_to_deliver+0x501/0x5b0 >> [ 51.823394] [] do_signal+0x42/0x110 >> [ 51.823399] [] ? rcu_eqs_exit_common+0x64/0x340 >> [ 51.823404] [] ? trace_hardirqs_on+0xd/0x10 >> [ 51.823407] [] ? trace_hardirqs_on_caller+0x128/0x160 >> [ 51.823409] [] ? trace_hardirqs_on+0xd/0x10 >> [ 51.823412] [] do_notify_resume+0x48/0xa0 >> [ 51.823415] [] retint_signal+0x4d/0x92 >> [ 51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48 >> 8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 01 48 8b >> [ 51.823449] RIP [] munlock_vma_page+0x12/0xf0 >> [ 51.823450] RSP >> [ 51.826846] ---[ end trace a7919e7f17c0a72a ]--- >> > > The similar warning prevents my system from booting. And it seems to me > that in munlock_vma_pages_range(), the page_mask needs be the page > number returned from munlock_vma_page() minus 1. And the following fix > solved my problem. Would you please have a try? Solved it here as well, awesome! Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx193.postini.com [74.125.245.193]) by kanga.kvack.org (Postfix) with SMTP id 469876B0005 for ; Thu, 7 Feb 2013 06:50:00 -0500 (EST) Received: by mail-ob0-f179.google.com with SMTP id un3so2567706obb.24 for ; Thu, 07 Feb 2013 03:49:59 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <5112EAE8.8070503@oracle.com> References: <1359962232-20811-1-git-send-email-walken@google.com> <1359962232-20811-4-git-send-email-walken@google.com> <5112EAE8.8070503@oracle.com> Date: Thu, 7 Feb 2013 19:49:59 +0800 Message-ID: Subject: Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages From: Hillf Danton Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Sasha Levin Cc: Michel Lespinasse , Andrea Arcangeli , Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org On Thu, Feb 7, 2013 at 7:44 AM, Sasha Levin wrote: > On 02/04/2013 02:17 AM, Michel Lespinasse wrote: >> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE >> at a time. When munlocking THP pages (or the huge zero page), this resulted >> in taking the mm->page_table_lock 512 times in a row. >> >> We can do better by making use of the page_mask returned by follow_page_mask >> (for the huge zero page case), or the size of the page munlock_vma_page() >> operated on (for the true THP page case). >> >> Note - I am sending this as RFC only for now as I can't currently put >> my finger on what if anything prevents split_huge_page() from operating >> concurrently on the same page as munlock_vma_page(), which would mess >> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle >> point I missed here ? >> >> Signed-off-by: Michel Lespinasse > > Hi Michel, > > Fuzzing with trinity inside a KVM tools guest produces a steady stream of: > > > [ 51.823275] ------------[ cut here ]------------ > [ 51.823302] kernel BUG at include/linux/page-flags.h:421! > [ 51.823307] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 51.823307] Dumping ftrace buffer: > [ 51.823314] (ftrace buffer empty) > [ 51.823314] Modules linked in: > [ 51.823314] CPU 2 > [ 51.823314] Pid: 7116, comm: trinity Tainted: G W 3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273 > [ 51.823316] RIP: 0010:[] [] munlock_vma_page+0x12/0xf0 > [ 51.823317] RSP: 0018:ffff880009641bb8 EFLAGS: 00010282 > [ 51.823319] RAX: 011ffc0000008001 RBX: ffffea0000410040 RCX: 0000000000000000 > [ 51.823320] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000410040 > [ 51.823321] RBP: ffff880009641bc8 R08: 0000000000000000 R09: 0000000000000000 > [ 51.823322] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880009633958 > [ 51.823324] R13: 0000000001252000 R14: ffffea0000410040 R15: 00000000000000ff > [ 51.823326] FS: 00007fe7a9046700(0000) GS:ffff88000ba00000(0000) knlGS:0000000000000000 > [ 51.823327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 51.823328] CR2: 00007fc583b90fcb CR3: 0000000009bc8000 CR4: 00000000000406e0 > [ 51.823334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 51.823338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 51.823340] Process trinity (pid: 7116, threadinfo ffff880009640000, task ffff880009638000) > [ 51.823341] Stack: > [ 51.823344] 0000000000a01000 ffff880009633958 ffff880009641c08 ffffffff812429bd > [ 51.823373] ffff880009638000 000001ff09638000 ffff880009ade000 ffff880009633958 > [ 51.823373] ffff880009638810 ffff880009ade098 ffff880009641cb8 ffffffff81246d81 > [ 51.823373] Call Trace: > [ 51.823373] [] munlock_vma_pages_range+0x8d/0xf0 > [ 51.823373] [] exit_mmap+0x51/0x170 > [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 > [ 51.823373] [] ? kmem_cache_free+0x22f/0x3b0 > [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 > [ 51.823373] [] mmput+0x77/0xe0 > [ 51.823377] [] exit_mm+0x113/0x120 > [ 51.823381] [] ? _raw_spin_unlock_irq+0x51/0x80 > [ 51.823384] [] do_exit+0x24a/0x590 > [ 51.823387] [] do_group_exit+0x8a/0xc0 > [ 51.823390] [] get_signal_to_deliver+0x501/0x5b0 > [ 51.823394] [] do_signal+0x42/0x110 > [ 51.823399] [] ? rcu_eqs_exit_common+0x64/0x340 > [ 51.823404] [] ? trace_hardirqs_on+0xd/0x10 > [ 51.823407] [] ? trace_hardirqs_on_caller+0x128/0x160 > [ 51.823409] [] ? trace_hardirqs_on+0xd/0x10 > [ 51.823412] [] do_notify_resume+0x48/0xa0 > [ 51.823415] [] retint_signal+0x4d/0x92 > [ 51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48 > 8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 01 48 8b > [ 51.823449] RIP [] munlock_vma_page+0x12/0xf0 > [ 51.823450] RSP > [ 51.826846] ---[ end trace a7919e7f17c0a72a ]--- > > Only is head page mlocked, and we have to avoid checking THP against tail page. Would you please try the following, Sasha? Hillf --- --- a/mm/mlock.c Thu Feb 7 19:43:20 2013 +++ b/mm/mlock.c Thu Feb 7 19:45:54 2013 @@ -104,12 +104,14 @@ void mlock_vma_page(struct page *page) */ unsigned int munlock_vma_page(struct page *page) { - unsigned int nr_pages = hpage_nr_pages(page); + unsigned int nr_pages = 0; BUG_ON(!PageLocked(page)); if (TestClearPageMlocked(page)) { + nr_pages = hpage_nr_pages(page); mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); + nr_pages--; if (!isolate_lru_page(page)) { int ret = SWAP_AGAIN; -- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx122.postini.com [74.125.245.122]) by kanga.kvack.org (Postfix) with SMTP id 6866A6B0002 for ; Fri, 8 Feb 2013 15:25:54 -0500 (EST) Date: Fri, 8 Feb 2013 21:25:51 +0100 From: Andrea Arcangeli Subject: Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages Message-ID: <20130208202550.GB9817@redhat.com> References: <1359962232-20811-1-git-send-email-walken@google.com> <1359962232-20811-4-git-send-email-walken@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1359962232-20811-4-git-send-email-walken@google.com> Sender: owner-linux-mm@kvack.org List-ID: To: Michel Lespinasse Cc: Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Hi Michel, On Sun, Feb 03, 2013 at 11:17:12PM -0800, Michel Lespinasse wrote: > munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE > at a time. When munlocking THP pages (or the huge zero page), this resulted > in taking the mm->page_table_lock 512 times in a row. > > We can do better by making use of the page_mask returned by follow_page_mask > (for the huge zero page case), or the size of the page munlock_vma_page() > operated on (for the true THP page case). > > Note - I am sending this as RFC only for now as I can't currently put > my finger on what if anything prevents split_huge_page() from operating > concurrently on the same page as munlock_vma_page(), which would mess > up our NR_MLOCK statistics. Is this a latent bug or is there a subtle > point I missed here ? I agree something looks fishy: nor mmap_sem for writing, nor the page lock can stop split_huge_page_refcount. Now the mlock side was intended to be safe because mlock_vma_page is called within follow_page while holding the PT lock or the page_table_lock (so split_huge_page_refcount will have to wait for it to be released before it can run). See follow_trans_huge_pmd assert_spin_locked and the pte_unmap_unlock after mlock_vma_page returns. Problem is, the lock side dependen on the TestSetPageMlocked below to be always repeated on the head page (follow_trans_huge_pmd will always pass the head page to mlock_vma_page). void mlock_vma_page(struct page *page) { BUG_ON(!PageLocked(page)); if (!TestSetPageMlocked(page)) { But what if the head page was split in between two different follow_page calls? The second call wouldn't take the pmd_trans_huge path anymore and the stats would be increased too much. The problem on the munlock side is even more apparent as you pointed out above but now I think the mlock side was problematic too. The good thing is, your accelleration code for the mlock side should have fixed the mlock race already: not ever risking to end up calling mlock_vma_page twice on the head page is not an "accelleration" only, it should also be a natural fix for the race. To fix the munlock side, which is still present, I think one way would be to do mlock and unlock within get_user_pages, so they run in the same place protected by the PT lock or page_table_lock. There are few things that stop split_huge_page_refcount: page_table_lock, lru_lock, compound_lock, anon_vma lock. So if we keep calling munlock_vma_page outside of get_user_pages (so outside of the page_table_lock) the other way would be to use the compound_lock. NOTE: this a purely aesthetical issue in /proc/meminfo, there's nothing functional (at least in the kernel) connected to it, so no panic :). Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx122.postini.com [74.125.245.122]) by kanga.kvack.org (Postfix) with SMTP id AB9296B0007 for ; Fri, 8 Feb 2013 18:17:06 -0500 (EST) Received: by mail-vb0-f51.google.com with SMTP id fq11so2610532vbb.24 for ; Fri, 08 Feb 2013 15:17:05 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20130208202550.GB9817@redhat.com> References: <1359962232-20811-1-git-send-email-walken@google.com> <1359962232-20811-4-git-send-email-walken@google.com> <20130208202550.GB9817@redhat.com> Date: Fri, 8 Feb 2013 15:17:05 -0800 Message-ID: Subject: Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages From: Michel Lespinasse Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Andrea Arcangeli Cc: Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org On Fri, Feb 8, 2013 at 12:25 PM, Andrea Arcangeli wrote: > Hi Michel, > > On Sun, Feb 03, 2013 at 11:17:12PM -0800, Michel Lespinasse wrote: >> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE >> at a time. When munlocking THP pages (or the huge zero page), this resulted >> in taking the mm->page_table_lock 512 times in a row. >> >> We can do better by making use of the page_mask returned by follow_page_mask >> (for the huge zero page case), or the size of the page munlock_vma_page() >> operated on (for the true THP page case). >> >> Note - I am sending this as RFC only for now as I can't currently put >> my finger on what if anything prevents split_huge_page() from operating >> concurrently on the same page as munlock_vma_page(), which would mess >> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle >> point I missed here ? > > I agree something looks fishy: nor mmap_sem for writing, nor the page > lock can stop split_huge_page_refcount. > > Now the mlock side was intended to be safe because mlock_vma_page is > called within follow_page while holding the PT lock or the > page_table_lock (so split_huge_page_refcount will have to wait for it > to be released before it can run). See follow_trans_huge_pmd > assert_spin_locked and the pte_unmap_unlock after mlock_vma_page > returns. > > Problem is, the lock side dependen on the TestSetPageMlocked below to > be always repeated on the head page (follow_trans_huge_pmd will always > pass the head page to mlock_vma_page). > > void mlock_vma_page(struct page *page) > { > BUG_ON(!PageLocked(page)); > > if (!TestSetPageMlocked(page)) { > > But what if the head page was split in between two different > follow_page calls? The second call wouldn't take the pmd_trans_huge > path anymore and the stats would be increased too much. Ah, good point. I am getting increasingly scared about the locking here the more I look at it :/ > The problem on the munlock side is even more apparent as you pointed > out above but now I think the mlock side was problematic too. > > The good thing is, your accelleration code for the mlock side should > have fixed the mlock race already: not ever risking to end up calling > mlock_vma_page twice on the head page is not an "accelleration" only, > it should also be a natural fix for the race. > > To fix the munlock side, which is still present, I think one way would > be to do mlock and unlock within get_user_pages, so they run in the > same place protected by the PT lock or page_table_lock. I actually had a quick try at this before submitting this patch series; the difficulty with it was that we want to have the page lock before calling munlock_vma_page() and we can't get it while holding the page table lock. So we'd need to do some gymnastics involving trylock_page() and if that fails, release the page table lock / lock the page / reacquire the page table lock / see if we locked the right page. This looked nasty, and then I noticed the existing hpage_nr_pages() test in munlock_vma_page() and decided to just ask you about it :) > There are few things that stop split_huge_page_refcount: > page_table_lock, lru_lock, compound_lock, anon_vma lock. So if we keep > calling munlock_vma_page outside of get_user_pages (so outside of the > page_table_lock) the other way would be to use the compound_lock. > > NOTE: this a purely aesthetical issue in /proc/meminfo, there's > nothing functional (at least in the kernel) connected to it, so no > panic :). Yes. I think we should try and get the locking right, because wrong code is just confusing, and it leads people to make incorrect assumptions and propagate them to places where it may have larger consequences. But other than that, I agree that the statistics issue we're talking about here doesn't sound too severe. I am in a bit of a bind here, as I will be taking a vacation abroad soon. I would like to get these 3 patches out before then, but I don't want to do anything too major as I'm not sure what kind of connectivity I'll have to fix things if needed. I think I'll go with the simplest option of adding the THP optimizaiton first and fixing the locking issues at a later stage... Thanks, -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753869Ab3BDHRS (ORCPT ); Mon, 4 Feb 2013 02:17:18 -0500 Received: from mail-pa0-f43.google.com ([209.85.220.43]:47112 "EHLO mail-pa0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752746Ab3BDHRQ (ORCPT ); Mon, 4 Feb 2013 02:17:16 -0500 From: Michel Lespinasse To: Andrea Arcangeli , Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Subject: [PATCH v2 0/3] fixes for large mm_populate() and munlock() operations Date: Sun, 3 Feb 2013 23:17:09 -0800 Message-Id: <1359962232-20811-1-git-send-email-walken@google.com> X-Mailer: git-send-email 1.8.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org These 3 changes are to improve the handling of large mm_populate and munlock operations. They apply on top of mmotm (in particular, they depend on both my prior mm_populate work and Kirill's "thp: avoid dumping huge zero page" change). - Patch 1 fixes an integer overflow issue when populating 2^32 pages. The nr_pages argument to get_user_pages would overflow, resulting in 0 pages being processed per iteration. I am proposing to simply convert the nr_pages argument to a long. - Patch 2 accelerates populating regions with THP pages. get_user_pages() can increment the address by a huge page size in this case instead of a small page size, and avoid repeated mm->page_table_lock acquisitions. This fixes an issue reported by Roman Dubtsov where populating regions via mmap MAP_POPULATE was significantly slower than doing so by touching pages from userspace. - Patch 3 is a similar acceleration for the munlock case. I would actually like to get Andrea's attention on this one, as I can't explain how munlock_vma_page() is safe against racing with split_huge_page(). Note that patches 1-2 are logically independent of patch 3, so if the discussion of patch 3 takes too long I would ask Andrew to consider merging patches 1-2 first. Changes since v1: - Andrew accepted patch 1 into his -mm tree but suggested the nr_pages argument type should actually be unsigned long; I am sending this as a "fix" for the previous patch 1 to be collapsed over the previous one. - In patch 2, I am adding a separate follow_page_mask() function so that the callers to the original follow_page() don't have to be modified to ignore the returned page_mask (following another suggestion from Andrew). Also the page_mask argument type was changed to unsigned int. - In patch 3, I similarly changed the page_mask values to unsigned int. Michel Lespinasse (3): fix mm: use long type for page counts in mm_populate() and get_user_pages() mm: accelerate mm_populate() treatment of THP pages mm: accelerate munlock() treatment of THP pages include/linux/hugetlb.h | 2 +- include/linux/mm.h | 24 +++++++++++++++++------- mm/hugetlb.c | 8 ++++---- mm/internal.h | 2 +- mm/memory.c | 43 +++++++++++++++++++++++++++++-------------- mm/mlock.c | 34 ++++++++++++++++++++++------------ mm/nommu.c | 6 ++++-- 7 files changed, 78 insertions(+), 41 deletions(-) -- 1.8.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753917Ab3BDHRT (ORCPT ); Mon, 4 Feb 2013 02:17:19 -0500 Received: from mail-da0-f43.google.com ([209.85.210.43]:38344 "EHLO mail-da0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753593Ab3BDHRS (ORCPT ); Mon, 4 Feb 2013 02:17:18 -0500 From: Michel Lespinasse To: Andrea Arcangeli , Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Subject: [PATCH v2 1/3] fix mm: use long type for page counts in mm_populate() and get_user_pages() Date: Sun, 3 Feb 2013 23:17:10 -0800 Message-Id: <1359962232-20811-2-git-send-email-walken@google.com> X-Mailer: git-send-email 1.8.1 In-Reply-To: <1359962232-20811-1-git-send-email-walken@google.com> References: <1359962232-20811-1-git-send-email-walken@google.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Andrew suggested I make the nr_pages argument an unsigned long rather than just a long. I was initially worried that nr_pages would be compared with signed longs, but this isn't the case, so his suggestion is perfectly valid. Sending as a 'fix' change, to be collapsed with the original in -mm. Signed-off-by: Michel Lespinasse --- include/linux/hugetlb.h | 2 +- include/linux/mm.h | 11 ++++++----- mm/hugetlb.c | 8 ++++---- mm/memory.c | 12 ++++++------ mm/mlock.c | 2 +- 5 files changed, 18 insertions(+), 17 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index fc6ed17cfd17..eedc334fb6f5 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -45,7 +45,7 @@ int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, - unsigned long *, long *, long, unsigned int flags); + unsigned long *, unsigned long *, long, unsigned int); void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, struct page *); void __unmap_hugepage_range_final(struct mmu_gather *tlb, diff --git a/include/linux/mm.h b/include/linux/mm.h index d5716094f191..3d9fbcf9fa94 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1041,12 +1041,13 @@ extern int access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf, int len, int write); long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, long len, unsigned int foll_flags, - struct page **pages, struct vm_area_struct **vmas, - int *nonblocking); + unsigned long start, unsigned long nr_pages, + unsigned int foll_flags, struct page **pages, + struct vm_area_struct **vmas, int *nonblocking); long get_user_pages(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, long nr_pages, int write, int force, - struct page **pages, struct vm_area_struct **vmas); + unsigned long start, unsigned long nr_pages, + int write, int force, struct page **pages, + struct vm_area_struct **vmas); int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); struct kvec; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 4ad07221ce60..951873c8f57e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2926,12 +2926,12 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address, long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, - unsigned long *position, long *length, long i, - unsigned int flags) + unsigned long *position, unsigned long *nr_pages, + long i, unsigned int flags) { unsigned long pfn_offset; unsigned long vaddr = *position; - long remainder = *length; + unsigned long remainder = *nr_pages; struct hstate *h = hstate_vma(vma); spin_lock(&mm->page_table_lock); @@ -3001,7 +3001,7 @@ same_page: } } spin_unlock(&mm->page_table_lock); - *length = remainder; + *nr_pages = remainder; *position = vaddr; return i ? i : -EFAULT; diff --git a/mm/memory.c b/mm/memory.c index 381b78c20d84..f0b6b2b798c4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1674,14 +1674,14 @@ static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long add * you need some special @gup_flags. */ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, long nr_pages, unsigned int gup_flags, - struct page **pages, struct vm_area_struct **vmas, - int *nonblocking) + unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas, int *nonblocking) { long i; unsigned long vm_flags; - if (nr_pages <= 0) + if (!nr_pages) return 0; VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET)); @@ -1978,8 +1978,8 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, * See also get_user_pages_fast, for performance critical applications. */ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, long nr_pages, int write, int force, - struct page **pages, struct vm_area_struct **vmas) + unsigned long start, unsigned long nr_pages, int write, + int force, struct page **pages, struct vm_area_struct **vmas) { int flags = FOLL_TOUCH; diff --git a/mm/mlock.c b/mm/mlock.c index e1fa9e4b0a66..6baaf4b0e591 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -160,7 +160,7 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma, { struct mm_struct *mm = vma->vm_mm; unsigned long addr = start; - long nr_pages = (end - start) / PAGE_SIZE; + unsigned long nr_pages = (end - start) / PAGE_SIZE; int gup_flags; VM_BUG_ON(start & ~PAGE_MASK); -- 1.8.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753985Ab3BDHR2 (ORCPT ); Mon, 4 Feb 2013 02:17:28 -0500 Received: from mail-da0-f44.google.com ([209.85.210.44]:60593 "EHLO mail-da0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753176Ab3BDHRW (ORCPT ); Mon, 4 Feb 2013 02:17:22 -0500 From: Michel Lespinasse To: Andrea Arcangeli , Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Subject: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages Date: Sun, 3 Feb 2013 23:17:12 -0800 Message-Id: <1359962232-20811-4-git-send-email-walken@google.com> X-Mailer: git-send-email 1.8.1 In-Reply-To: <1359962232-20811-1-git-send-email-walken@google.com> References: <1359962232-20811-1-git-send-email-walken@google.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE at a time. When munlocking THP pages (or the huge zero page), this resulted in taking the mm->page_table_lock 512 times in a row. We can do better by making use of the page_mask returned by follow_page_mask (for the huge zero page case), or the size of the page munlock_vma_page() operated on (for the true THP page case). Note - I am sending this as RFC only for now as I can't currently put my finger on what if anything prevents split_huge_page() from operating concurrently on the same page as munlock_vma_page(), which would mess up our NR_MLOCK statistics. Is this a latent bug or is there a subtle point I missed here ? Signed-off-by: Michel Lespinasse --- mm/internal.h | 2 +- mm/mlock.c | 32 +++++++++++++++++++++----------- 2 files changed, 22 insertions(+), 12 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 1c0c4cc0fcf7..8562de0a5197 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -195,7 +195,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *vma, * must be called with vma's mmap_sem held for read or write, and page locked. */ extern void mlock_vma_page(struct page *page); -extern void munlock_vma_page(struct page *page); +extern unsigned int munlock_vma_page(struct page *page); /* * Clear the page's PageMlocked(). This can be useful in a situation where diff --git a/mm/mlock.c b/mm/mlock.c index 6baaf4b0e591..486702edee35 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -102,13 +102,14 @@ void mlock_vma_page(struct page *page) * can't isolate the page, we leave it for putback_lru_page() and vmscan * [page_referenced()/try_to_unmap()] to deal with. */ -void munlock_vma_page(struct page *page) +unsigned int munlock_vma_page(struct page *page) { + unsigned int nr_pages = hpage_nr_pages(page); + BUG_ON(!PageLocked(page)); if (TestClearPageMlocked(page)) { - mod_zone_page_state(page_zone(page), NR_MLOCK, - -hpage_nr_pages(page)); + mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); if (!isolate_lru_page(page)) { int ret = SWAP_AGAIN; @@ -141,6 +142,8 @@ void munlock_vma_page(struct page *page) count_vm_event(UNEVICTABLE_PGMUNLOCKED); } } + + return nr_pages; } /** @@ -159,7 +162,6 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, int *nonblocking) { struct mm_struct *mm = vma->vm_mm; - unsigned long addr = start; unsigned long nr_pages = (end - start) / PAGE_SIZE; int gup_flags; @@ -185,7 +187,7 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma, if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) gup_flags |= FOLL_FORCE; - return __get_user_pages(current, mm, addr, nr_pages, gup_flags, + return __get_user_pages(current, mm, start, nr_pages, gup_flags, NULL, NULL, nonblocking); } @@ -222,13 +224,12 @@ static int __mlock_posix_error_return(long retval) void munlock_vma_pages_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) { - unsigned long addr; - - lru_add_drain(); vma->vm_flags &= ~VM_LOCKED; - for (addr = start; addr < end; addr += PAGE_SIZE) { + while (start < end) { struct page *page; + unsigned int page_mask, page_increm; + /* * Although FOLL_DUMP is intended for get_dump_page(), * it just so happens that its special treatment of the @@ -236,13 +237,22 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, * suits munlock very well (and if somehow an abnormal page * has sneaked into the range, we won't oops here: great). */ - page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP); + page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP, + &page_mask); if (page && !IS_ERR(page)) { lock_page(page); - munlock_vma_page(page); + lru_add_drain(); + /* + * Any THP page found by follow_page_mask() may have + * gotten split before reaching munlock_vma_page(), + * so we need to recompute the page_mask here. + */ + page_mask = munlock_vma_page(page); unlock_page(page); put_page(page); } + page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask); + start += page_increm * PAGE_SIZE; cond_resched(); } } -- 1.8.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753946Ab3BDHRX (ORCPT ); Mon, 4 Feb 2013 02:17:23 -0500 Received: from mail-da0-f52.google.com ([209.85.210.52]:55007 "EHLO mail-da0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753932Ab3BDHRU (ORCPT ); Mon, 4 Feb 2013 02:17:20 -0500 From: Michel Lespinasse To: Andrea Arcangeli , Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Subject: [PATCH v2 2/3] mm: accelerate mm_populate() treatment of THP pages Date: Sun, 3 Feb 2013 23:17:11 -0800 Message-Id: <1359962232-20811-3-git-send-email-walken@google.com> X-Mailer: git-send-email 1.8.1 In-Reply-To: <1359962232-20811-1-git-send-email-walken@google.com> References: <1359962232-20811-1-git-send-email-walken@google.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This change adds a follow_page_mask function which is equivalent to follow_page, but with an extra page_mask argument. follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters a THP page, and to 0 in other cases. __get_user_pages() makes use of this in order to accelerate populating THP ranges - that is, when both the pages and vmas arrays are NULL, we don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and we also avoid taking mm->page_table_lock that many times). Signed-off-by: Michel Lespinasse --- include/linux/mm.h | 13 +++++++++++-- mm/memory.c | 31 +++++++++++++++++++++++-------- mm/nommu.c | 6 ++++-- 3 files changed, 38 insertions(+), 12 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 3d9fbcf9fa94..31e4d42002ee 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1636,8 +1636,17 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); -struct page *follow_page(struct vm_area_struct *, unsigned long address, - unsigned int foll_flags); +struct page *follow_page_mask(struct vm_area_struct *vma, + unsigned long address, unsigned int foll_flags, + unsigned int *page_mask); + +static inline struct page *follow_page(struct vm_area_struct *vma, + unsigned long address, unsigned int foll_flags) +{ + unsigned int unused_page_mask; + return follow_page_mask(vma, address, foll_flags, &unused_page_mask); +} + #define FOLL_WRITE 0x01 /* check pte is writable */ #define FOLL_TOUCH 0x02 /* mark page accessed */ #define FOLL_GET 0x04 /* do get_page on page */ diff --git a/mm/memory.c b/mm/memory.c index f0b6b2b798c4..52c8599e7fe4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1458,10 +1458,11 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, EXPORT_SYMBOL_GPL(zap_vma_ptes); /** - * follow_page - look up a page descriptor from a user-virtual address + * follow_page_mask - look up a page descriptor from a user-virtual address * @vma: vm_area_struct mapping @address * @address: virtual address to look up * @flags: flags modifying lookup behaviour + * @page_mask: on output, *page_mask is set according to the size of the page * * @flags can have FOLL_ flags set, defined in * @@ -1469,8 +1470,9 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes); * an error pointer if there is a mapping to something not represented * by a page descriptor (see also vm_normal_page()). */ -struct page *follow_page(struct vm_area_struct *vma, unsigned long address, - unsigned int flags) +struct page *follow_page_mask(struct vm_area_struct *vma, + unsigned long address, unsigned int flags, + unsigned int *page_mask) { pgd_t *pgd; pud_t *pud; @@ -1480,6 +1482,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, struct page *page; struct mm_struct *mm = vma->vm_mm; + *page_mask = 0; + page = follow_huge_addr(mm, address, flags & FOLL_WRITE); if (!IS_ERR(page)) { BUG_ON(flags & FOLL_GET); @@ -1526,6 +1530,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, page = follow_trans_huge_pmd(vma, address, pmd, flags); spin_unlock(&mm->page_table_lock); + *page_mask = HPAGE_PMD_NR - 1; goto out; } } else @@ -1680,6 +1685,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, { long i; unsigned long vm_flags; + unsigned int page_mask; if (!nr_pages) return 0; @@ -1757,6 +1763,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, get_page(page); } pte_unmap(pte); + page_mask = 0; goto next_page; } @@ -1774,6 +1781,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, do { struct page *page; unsigned int foll_flags = gup_flags; + unsigned int page_increm; /* * If we have a pending SIGKILL, don't keep faulting @@ -1783,7 +1791,8 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, return i ? i : -ERESTARTSYS; cond_resched(); - while (!(page = follow_page(vma, start, foll_flags))) { + while (!(page = follow_page_mask(vma, start, + foll_flags, &page_mask))) { int ret; unsigned int fault_flags = 0; @@ -1857,13 +1866,19 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, flush_anon_page(vma, page, start); flush_dcache_page(page); + page_mask = 0; } next_page: - if (vmas) + if (vmas) { vmas[i] = vma; - i++; - start += PAGE_SIZE; - nr_pages--; + page_mask = 0; + } + page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask); + if (page_increm > nr_pages) + page_increm = nr_pages; + i += page_increm; + start += page_increm * PAGE_SIZE; + nr_pages -= page_increm; } while (nr_pages && start < vma->vm_end); } while (nr_pages); return i; diff --git a/mm/nommu.c b/mm/nommu.c index 429a3d5217fa..9a6a25181267 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1817,9 +1817,11 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, return ret; } -struct page *follow_page(struct vm_area_struct *vma, unsigned long address, - unsigned int foll_flags) +struct page *follow_page_mask(struct vm_area_struct *vma, + unsigned long address, unsigned int flags, + unsigned int *page_mask) { + *page_mask = 0; return NULL; } -- 1.8.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758407Ab3BFXpq (ORCPT ); Wed, 6 Feb 2013 18:45:46 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:40576 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755489Ab3BFXpp (ORCPT ); Wed, 6 Feb 2013 18:45:45 -0500 Message-ID: <5112EAE8.8070503@oracle.com> Date: Wed, 06 Feb 2013 18:44:40 -0500 From: Sasha Levin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130113 Thunderbird/17.0.2 MIME-Version: 1.0 To: Michel Lespinasse CC: Andrea Arcangeli , Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages References: <1359962232-20811-1-git-send-email-walken@google.com> <1359962232-20811-4-git-send-email-walken@google.com> In-Reply-To: <1359962232-20811-4-git-send-email-walken@google.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Source-IP: ucsinet22.oracle.com [156.151.31.94] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/04/2013 02:17 AM, Michel Lespinasse wrote: > munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE > at a time. When munlocking THP pages (or the huge zero page), this resulted > in taking the mm->page_table_lock 512 times in a row. > > We can do better by making use of the page_mask returned by follow_page_mask > (for the huge zero page case), or the size of the page munlock_vma_page() > operated on (for the true THP page case). > > Note - I am sending this as RFC only for now as I can't currently put > my finger on what if anything prevents split_huge_page() from operating > concurrently on the same page as munlock_vma_page(), which would mess > up our NR_MLOCK statistics. Is this a latent bug or is there a subtle > point I missed here ? > > Signed-off-by: Michel Lespinasse Hi Michel, Fuzzing with trinity inside a KVM tools guest produces a steady stream of: [ 51.823275] ------------[ cut here ]------------ [ 51.823302] kernel BUG at include/linux/page-flags.h:421! [ 51.823307] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 51.823307] Dumping ftrace buffer: [ 51.823314] (ftrace buffer empty) [ 51.823314] Modules linked in: [ 51.823314] CPU 2 [ 51.823314] Pid: 7116, comm: trinity Tainted: G W 3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273 [ 51.823316] RIP: 0010:[] [] munlock_vma_page+0x12/0xf0 [ 51.823317] RSP: 0018:ffff880009641bb8 EFLAGS: 00010282 [ 51.823319] RAX: 011ffc0000008001 RBX: ffffea0000410040 RCX: 0000000000000000 [ 51.823320] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000410040 [ 51.823321] RBP: ffff880009641bc8 R08: 0000000000000000 R09: 0000000000000000 [ 51.823322] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880009633958 [ 51.823324] R13: 0000000001252000 R14: ffffea0000410040 R15: 00000000000000ff [ 51.823326] FS: 00007fe7a9046700(0000) GS:ffff88000ba00000(0000) knlGS:0000000000000000 [ 51.823327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 51.823328] CR2: 00007fc583b90fcb CR3: 0000000009bc8000 CR4: 00000000000406e0 [ 51.823334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 51.823338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 51.823340] Process trinity (pid: 7116, threadinfo ffff880009640000, task ffff880009638000) [ 51.823341] Stack: [ 51.823344] 0000000000a01000 ffff880009633958 ffff880009641c08 ffffffff812429bd [ 51.823373] ffff880009638000 000001ff09638000 ffff880009ade000 ffff880009633958 [ 51.823373] ffff880009638810 ffff880009ade098 ffff880009641cb8 ffffffff81246d81 [ 51.823373] Call Trace: [ 51.823373] [] munlock_vma_pages_range+0x8d/0xf0 [ 51.823373] [] exit_mmap+0x51/0x170 [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 [ 51.823373] [] ? kmem_cache_free+0x22f/0x3b0 [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 [ 51.823373] [] mmput+0x77/0xe0 [ 51.823377] [] exit_mm+0x113/0x120 [ 51.823381] [] ? _raw_spin_unlock_irq+0x51/0x80 [ 51.823384] [] do_exit+0x24a/0x590 [ 51.823387] [] do_group_exit+0x8a/0xc0 [ 51.823390] [] get_signal_to_deliver+0x501/0x5b0 [ 51.823394] [] do_signal+0x42/0x110 [ 51.823399] [] ? rcu_eqs_exit_common+0x64/0x340 [ 51.823404] [] ? trace_hardirqs_on+0xd/0x10 [ 51.823407] [] ? trace_hardirqs_on_caller+0x128/0x160 [ 51.823409] [] ? trace_hardirqs_on+0xd/0x10 [ 51.823412] [] do_notify_resume+0x48/0xa0 [ 51.823415] [] retint_signal+0x4d/0x92 [ 51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48 8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 01 48 8b [ 51.823449] RIP [] munlock_vma_page+0x12/0xf0 [ 51.823450] RSP [ 51.826846] ---[ end trace a7919e7f17c0a72a ]--- Thanks, Sasha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758000Ab3BGCu7 (ORCPT ); Wed, 6 Feb 2013 21:50:59 -0500 Received: from e23smtp09.au.ibm.com ([202.81.31.142]:60546 "EHLO e23smtp09.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757607Ab3BGCu6 (ORCPT ); Wed, 6 Feb 2013 21:50:58 -0500 Message-ID: <1360205438.13550.11.camel@ThinkPad-T5421.cn.ibm.com> Subject: Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages From: Li Zhong To: Sasha Levin Cc: Michel Lespinasse , Andrea Arcangeli , Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Date: Thu, 07 Feb 2013 10:50:38 +0800 In-Reply-To: <5112EAE8.8070503@oracle.com> References: <1359962232-20811-1-git-send-email-walken@google.com> <1359962232-20811-4-git-send-email-walken@google.com> <5112EAE8.8070503@oracle.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.2.3-0ubuntu6 Content-Transfer-Encoding: 7bit Mime-Version: 1.0 X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13020702-3568-0000-0000-0000031FBB4B Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2013-02-06 at 18:44 -0500, Sasha Levin wrote: > On 02/04/2013 02:17 AM, Michel Lespinasse wrote: > > munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE > > at a time. When munlocking THP pages (or the huge zero page), this resulted > > in taking the mm->page_table_lock 512 times in a row. > > > > We can do better by making use of the page_mask returned by follow_page_mask > > (for the huge zero page case), or the size of the page munlock_vma_page() > > operated on (for the true THP page case). > > > > Note - I am sending this as RFC only for now as I can't currently put > > my finger on what if anything prevents split_huge_page() from operating > > concurrently on the same page as munlock_vma_page(), which would mess > > up our NR_MLOCK statistics. Is this a latent bug or is there a subtle > > point I missed here ? > > > > Signed-off-by: Michel Lespinasse > > Hi Michel, > > Fuzzing with trinity inside a KVM tools guest produces a steady stream of: > > > [ 51.823275] ------------[ cut here ]------------ > [ 51.823302] kernel BUG at include/linux/page-flags.h:421! > [ 51.823307] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 51.823307] Dumping ftrace buffer: > [ 51.823314] (ftrace buffer empty) > [ 51.823314] Modules linked in: > [ 51.823314] CPU 2 > [ 51.823314] Pid: 7116, comm: trinity Tainted: G W 3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273 > [ 51.823316] RIP: 0010:[] [] munlock_vma_page+0x12/0xf0 > [ 51.823317] RSP: 0018:ffff880009641bb8 EFLAGS: 00010282 > [ 51.823319] RAX: 011ffc0000008001 RBX: ffffea0000410040 RCX: 0000000000000000 > [ 51.823320] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000410040 > [ 51.823321] RBP: ffff880009641bc8 R08: 0000000000000000 R09: 0000000000000000 > [ 51.823322] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880009633958 > [ 51.823324] R13: 0000000001252000 R14: ffffea0000410040 R15: 00000000000000ff > [ 51.823326] FS: 00007fe7a9046700(0000) GS:ffff88000ba00000(0000) knlGS:0000000000000000 > [ 51.823327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 51.823328] CR2: 00007fc583b90fcb CR3: 0000000009bc8000 CR4: 00000000000406e0 > [ 51.823334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 51.823338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 51.823340] Process trinity (pid: 7116, threadinfo ffff880009640000, task ffff880009638000) > [ 51.823341] Stack: > [ 51.823344] 0000000000a01000 ffff880009633958 ffff880009641c08 ffffffff812429bd > [ 51.823373] ffff880009638000 000001ff09638000 ffff880009ade000 ffff880009633958 > [ 51.823373] ffff880009638810 ffff880009ade098 ffff880009641cb8 ffffffff81246d81 > [ 51.823373] Call Trace: > [ 51.823373] [] munlock_vma_pages_range+0x8d/0xf0 > [ 51.823373] [] exit_mmap+0x51/0x170 > [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 > [ 51.823373] [] ? kmem_cache_free+0x22f/0x3b0 > [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 > [ 51.823373] [] mmput+0x77/0xe0 > [ 51.823377] [] exit_mm+0x113/0x120 > [ 51.823381] [] ? _raw_spin_unlock_irq+0x51/0x80 > [ 51.823384] [] do_exit+0x24a/0x590 > [ 51.823387] [] do_group_exit+0x8a/0xc0 > [ 51.823390] [] get_signal_to_deliver+0x501/0x5b0 > [ 51.823394] [] do_signal+0x42/0x110 > [ 51.823399] [] ? rcu_eqs_exit_common+0x64/0x340 > [ 51.823404] [] ? trace_hardirqs_on+0xd/0x10 > [ 51.823407] [] ? trace_hardirqs_on_caller+0x128/0x160 > [ 51.823409] [] ? trace_hardirqs_on+0xd/0x10 > [ 51.823412] [] do_notify_resume+0x48/0xa0 > [ 51.823415] [] retint_signal+0x4d/0x92 > [ 51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48 > 8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 01 48 8b > [ 51.823449] RIP [] munlock_vma_page+0x12/0xf0 > [ 51.823450] RSP > [ 51.826846] ---[ end trace a7919e7f17c0a72a ]--- > The similar warning prevents my system from booting. And it seems to me that in munlock_vma_pages_range(), the page_mask needs be the page number returned from munlock_vma_page() minus 1. And the following fix solved my problem. Would you please have a try? Thanks, Zhong ================ diff --git a/mm/mlock.c b/mm/mlock.c index af1d115..1e3d794 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -255,7 +255,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, unlock_page(page); put_page(page); } - page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask); + page_increm = 1 + (~(start >> PAGE_SHIFT) & (page_mask-1)); start += page_increm * PAGE_SIZE; cond_resched(); } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751329Ab3BGFnj (ORCPT ); Thu, 7 Feb 2013 00:43:39 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:37069 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750724Ab3BGFni (ORCPT ); Thu, 7 Feb 2013 00:43:38 -0500 Message-ID: <51133EC6.4010902@oracle.com> Date: Thu, 07 Feb 2013 00:42:30 -0500 From: Sasha Levin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130113 Thunderbird/17.0.2 MIME-Version: 1.0 To: Li Zhong CC: Michel Lespinasse , Andrea Arcangeli , Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages References: <1359962232-20811-1-git-send-email-walken@google.com> <1359962232-20811-4-git-send-email-walken@google.com> <5112EAE8.8070503@oracle.com> <1360205438.13550.11.camel@ThinkPad-T5421.cn.ibm.com> In-Reply-To: <1360205438.13550.11.camel@ThinkPad-T5421.cn.ibm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Source-IP: acsinet21.oracle.com [141.146.126.237] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/06/2013 09:50 PM, Li Zhong wrote: > On Wed, 2013-02-06 at 18:44 -0500, Sasha Levin wrote: >> On 02/04/2013 02:17 AM, Michel Lespinasse wrote: >>> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE >>> at a time. When munlocking THP pages (or the huge zero page), this resulted >>> in taking the mm->page_table_lock 512 times in a row. >>> >>> We can do better by making use of the page_mask returned by follow_page_mask >>> (for the huge zero page case), or the size of the page munlock_vma_page() >>> operated on (for the true THP page case). >>> >>> Note - I am sending this as RFC only for now as I can't currently put >>> my finger on what if anything prevents split_huge_page() from operating >>> concurrently on the same page as munlock_vma_page(), which would mess >>> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle >>> point I missed here ? >>> >>> Signed-off-by: Michel Lespinasse >> >> Hi Michel, >> >> Fuzzing with trinity inside a KVM tools guest produces a steady stream of: >> >> >> [ 51.823275] ------------[ cut here ]------------ >> [ 51.823302] kernel BUG at include/linux/page-flags.h:421! >> [ 51.823307] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC >> [ 51.823307] Dumping ftrace buffer: >> [ 51.823314] (ftrace buffer empty) >> [ 51.823314] Modules linked in: >> [ 51.823314] CPU 2 >> [ 51.823314] Pid: 7116, comm: trinity Tainted: G W 3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273 >> [ 51.823316] RIP: 0010:[] [] munlock_vma_page+0x12/0xf0 >> [ 51.823317] RSP: 0018:ffff880009641bb8 EFLAGS: 00010282 >> [ 51.823319] RAX: 011ffc0000008001 RBX: ffffea0000410040 RCX: 0000000000000000 >> [ 51.823320] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000410040 >> [ 51.823321] RBP: ffff880009641bc8 R08: 0000000000000000 R09: 0000000000000000 >> [ 51.823322] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880009633958 >> [ 51.823324] R13: 0000000001252000 R14: ffffea0000410040 R15: 00000000000000ff >> [ 51.823326] FS: 00007fe7a9046700(0000) GS:ffff88000ba00000(0000) knlGS:0000000000000000 >> [ 51.823327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 51.823328] CR2: 00007fc583b90fcb CR3: 0000000009bc8000 CR4: 00000000000406e0 >> [ 51.823334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> [ 51.823338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> [ 51.823340] Process trinity (pid: 7116, threadinfo ffff880009640000, task ffff880009638000) >> [ 51.823341] Stack: >> [ 51.823344] 0000000000a01000 ffff880009633958 ffff880009641c08 ffffffff812429bd >> [ 51.823373] ffff880009638000 000001ff09638000 ffff880009ade000 ffff880009633958 >> [ 51.823373] ffff880009638810 ffff880009ade098 ffff880009641cb8 ffffffff81246d81 >> [ 51.823373] Call Trace: >> [ 51.823373] [] munlock_vma_pages_range+0x8d/0xf0 >> [ 51.823373] [] exit_mmap+0x51/0x170 >> [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 >> [ 51.823373] [] ? kmem_cache_free+0x22f/0x3b0 >> [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 >> [ 51.823373] [] mmput+0x77/0xe0 >> [ 51.823377] [] exit_mm+0x113/0x120 >> [ 51.823381] [] ? _raw_spin_unlock_irq+0x51/0x80 >> [ 51.823384] [] do_exit+0x24a/0x590 >> [ 51.823387] [] do_group_exit+0x8a/0xc0 >> [ 51.823390] [] get_signal_to_deliver+0x501/0x5b0 >> [ 51.823394] [] do_signal+0x42/0x110 >> [ 51.823399] [] ? rcu_eqs_exit_common+0x64/0x340 >> [ 51.823404] [] ? trace_hardirqs_on+0xd/0x10 >> [ 51.823407] [] ? trace_hardirqs_on_caller+0x128/0x160 >> [ 51.823409] [] ? trace_hardirqs_on+0xd/0x10 >> [ 51.823412] [] do_notify_resume+0x48/0xa0 >> [ 51.823415] [] retint_signal+0x4d/0x92 >> [ 51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48 >> 8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 01 48 8b >> [ 51.823449] RIP [] munlock_vma_page+0x12/0xf0 >> [ 51.823450] RSP >> [ 51.826846] ---[ end trace a7919e7f17c0a72a ]--- >> > > The similar warning prevents my system from booting. And it seems to me > that in munlock_vma_pages_range(), the page_mask needs be the page > number returned from munlock_vma_page() minus 1. And the following fix > solved my problem. Would you please have a try? Solved it here as well, awesome! Thanks, Sasha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758068Ab3BGLuB (ORCPT ); Thu, 7 Feb 2013 06:50:01 -0500 Received: from mail-oa0-f54.google.com ([209.85.219.54]:61754 "EHLO mail-oa0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756024Ab3BGLt7 (ORCPT ); Thu, 7 Feb 2013 06:49:59 -0500 MIME-Version: 1.0 In-Reply-To: <5112EAE8.8070503@oracle.com> References: <1359962232-20811-1-git-send-email-walken@google.com> <1359962232-20811-4-git-send-email-walken@google.com> <5112EAE8.8070503@oracle.com> Date: Thu, 7 Feb 2013 19:49:59 +0800 Message-ID: Subject: Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages From: Hillf Danton To: Sasha Levin Cc: Michel Lespinasse , Andrea Arcangeli , Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 7, 2013 at 7:44 AM, Sasha Levin wrote: > On 02/04/2013 02:17 AM, Michel Lespinasse wrote: >> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE >> at a time. When munlocking THP pages (or the huge zero page), this resulted >> in taking the mm->page_table_lock 512 times in a row. >> >> We can do better by making use of the page_mask returned by follow_page_mask >> (for the huge zero page case), or the size of the page munlock_vma_page() >> operated on (for the true THP page case). >> >> Note - I am sending this as RFC only for now as I can't currently put >> my finger on what if anything prevents split_huge_page() from operating >> concurrently on the same page as munlock_vma_page(), which would mess >> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle >> point I missed here ? >> >> Signed-off-by: Michel Lespinasse > > Hi Michel, > > Fuzzing with trinity inside a KVM tools guest produces a steady stream of: > > > [ 51.823275] ------------[ cut here ]------------ > [ 51.823302] kernel BUG at include/linux/page-flags.h:421! > [ 51.823307] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 51.823307] Dumping ftrace buffer: > [ 51.823314] (ftrace buffer empty) > [ 51.823314] Modules linked in: > [ 51.823314] CPU 2 > [ 51.823314] Pid: 7116, comm: trinity Tainted: G W 3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273 > [ 51.823316] RIP: 0010:[] [] munlock_vma_page+0x12/0xf0 > [ 51.823317] RSP: 0018:ffff880009641bb8 EFLAGS: 00010282 > [ 51.823319] RAX: 011ffc0000008001 RBX: ffffea0000410040 RCX: 0000000000000000 > [ 51.823320] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000410040 > [ 51.823321] RBP: ffff880009641bc8 R08: 0000000000000000 R09: 0000000000000000 > [ 51.823322] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880009633958 > [ 51.823324] R13: 0000000001252000 R14: ffffea0000410040 R15: 00000000000000ff > [ 51.823326] FS: 00007fe7a9046700(0000) GS:ffff88000ba00000(0000) knlGS:0000000000000000 > [ 51.823327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 51.823328] CR2: 00007fc583b90fcb CR3: 0000000009bc8000 CR4: 00000000000406e0 > [ 51.823334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 51.823338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 51.823340] Process trinity (pid: 7116, threadinfo ffff880009640000, task ffff880009638000) > [ 51.823341] Stack: > [ 51.823344] 0000000000a01000 ffff880009633958 ffff880009641c08 ffffffff812429bd > [ 51.823373] ffff880009638000 000001ff09638000 ffff880009ade000 ffff880009633958 > [ 51.823373] ffff880009638810 ffff880009ade098 ffff880009641cb8 ffffffff81246d81 > [ 51.823373] Call Trace: > [ 51.823373] [] munlock_vma_pages_range+0x8d/0xf0 > [ 51.823373] [] exit_mmap+0x51/0x170 > [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 > [ 51.823373] [] ? kmem_cache_free+0x22f/0x3b0 > [ 51.823373] [] ? __khugepaged_exit+0x8a/0xf0 > [ 51.823373] [] mmput+0x77/0xe0 > [ 51.823377] [] exit_mm+0x113/0x120 > [ 51.823381] [] ? _raw_spin_unlock_irq+0x51/0x80 > [ 51.823384] [] do_exit+0x24a/0x590 > [ 51.823387] [] do_group_exit+0x8a/0xc0 > [ 51.823390] [] get_signal_to_deliver+0x501/0x5b0 > [ 51.823394] [] do_signal+0x42/0x110 > [ 51.823399] [] ? rcu_eqs_exit_common+0x64/0x340 > [ 51.823404] [] ? trace_hardirqs_on+0xd/0x10 > [ 51.823407] [] ? trace_hardirqs_on_caller+0x128/0x160 > [ 51.823409] [] ? trace_hardirqs_on+0xd/0x10 > [ 51.823412] [] do_notify_resume+0x48/0xa0 > [ 51.823415] [] retint_signal+0x4d/0x92 > [ 51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48 > 8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 01 48 8b > [ 51.823449] RIP [] munlock_vma_page+0x12/0xf0 > [ 51.823450] RSP > [ 51.826846] ---[ end trace a7919e7f17c0a72a ]--- > > Only is head page mlocked, and we have to avoid checking THP against tail page. Would you please try the following, Sasha? Hillf --- --- a/mm/mlock.c Thu Feb 7 19:43:20 2013 +++ b/mm/mlock.c Thu Feb 7 19:45:54 2013 @@ -104,12 +104,14 @@ void mlock_vma_page(struct page *page) */ unsigned int munlock_vma_page(struct page *page) { - unsigned int nr_pages = hpage_nr_pages(page); + unsigned int nr_pages = 0; BUG_ON(!PageLocked(page)); if (TestClearPageMlocked(page)) { + nr_pages = hpage_nr_pages(page); mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); + nr_pages--; if (!isolate_lru_page(page)) { int ret = SWAP_AGAIN; -- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1947026Ab3BHU0i (ORCPT ); Fri, 8 Feb 2013 15:26:38 -0500 Received: from mx1.redhat.com ([209.132.183.28]:26133 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1946902Ab3BHU0h (ORCPT ); Fri, 8 Feb 2013 15:26:37 -0500 Date: Fri, 8 Feb 2013 21:25:51 +0100 From: Andrea Arcangeli To: Michel Lespinasse Cc: Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages Message-ID: <20130208202550.GB9817@redhat.com> References: <1359962232-20811-1-git-send-email-walken@google.com> <1359962232-20811-4-git-send-email-walken@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1359962232-20811-4-git-send-email-walken@google.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Michel, On Sun, Feb 03, 2013 at 11:17:12PM -0800, Michel Lespinasse wrote: > munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE > at a time. When munlocking THP pages (or the huge zero page), this resulted > in taking the mm->page_table_lock 512 times in a row. > > We can do better by making use of the page_mask returned by follow_page_mask > (for the huge zero page case), or the size of the page munlock_vma_page() > operated on (for the true THP page case). > > Note - I am sending this as RFC only for now as I can't currently put > my finger on what if anything prevents split_huge_page() from operating > concurrently on the same page as munlock_vma_page(), which would mess > up our NR_MLOCK statistics. Is this a latent bug or is there a subtle > point I missed here ? I agree something looks fishy: nor mmap_sem for writing, nor the page lock can stop split_huge_page_refcount. Now the mlock side was intended to be safe because mlock_vma_page is called within follow_page while holding the PT lock or the page_table_lock (so split_huge_page_refcount will have to wait for it to be released before it can run). See follow_trans_huge_pmd assert_spin_locked and the pte_unmap_unlock after mlock_vma_page returns. Problem is, the lock side dependen on the TestSetPageMlocked below to be always repeated on the head page (follow_trans_huge_pmd will always pass the head page to mlock_vma_page). void mlock_vma_page(struct page *page) { BUG_ON(!PageLocked(page)); if (!TestSetPageMlocked(page)) { But what if the head page was split in between two different follow_page calls? The second call wouldn't take the pmd_trans_huge path anymore and the stats would be increased too much. The problem on the munlock side is even more apparent as you pointed out above but now I think the mlock side was problematic too. The good thing is, your accelleration code for the mlock side should have fixed the mlock race already: not ever risking to end up calling mlock_vma_page twice on the head page is not an "accelleration" only, it should also be a natural fix for the race. To fix the munlock side, which is still present, I think one way would be to do mlock and unlock within get_user_pages, so they run in the same place protected by the PT lock or page_table_lock. There are few things that stop split_huge_page_refcount: page_table_lock, lru_lock, compound_lock, anon_vma lock. So if we keep calling munlock_vma_page outside of get_user_pages (so outside of the page_table_lock) the other way would be to use the compound_lock. NOTE: this a purely aesthetical issue in /proc/meminfo, there's nothing functional (at least in the kernel) connected to it, so no panic :). Thanks, Andrea From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1947322Ab3BHXRJ (ORCPT ); Fri, 8 Feb 2013 18:17:09 -0500 Received: from mail-ve0-f177.google.com ([209.85.128.177]:59324 "EHLO mail-ve0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757205Ab3BHXRH (ORCPT ); Fri, 8 Feb 2013 18:17:07 -0500 MIME-Version: 1.0 In-Reply-To: <20130208202550.GB9817@redhat.com> References: <1359962232-20811-1-git-send-email-walken@google.com> <1359962232-20811-4-git-send-email-walken@google.com> <20130208202550.GB9817@redhat.com> Date: Fri, 8 Feb 2013 15:17:05 -0800 Message-ID: Subject: Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages From: Michel Lespinasse To: Andrea Arcangeli Cc: Rik van Riel , Mel Gorman , Hugh Dickins , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Feb 8, 2013 at 12:25 PM, Andrea Arcangeli wrote: > Hi Michel, > > On Sun, Feb 03, 2013 at 11:17:12PM -0800, Michel Lespinasse wrote: >> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE >> at a time. When munlocking THP pages (or the huge zero page), this resulted >> in taking the mm->page_table_lock 512 times in a row. >> >> We can do better by making use of the page_mask returned by follow_page_mask >> (for the huge zero page case), or the size of the page munlock_vma_page() >> operated on (for the true THP page case). >> >> Note - I am sending this as RFC only for now as I can't currently put >> my finger on what if anything prevents split_huge_page() from operating >> concurrently on the same page as munlock_vma_page(), which would mess >> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle >> point I missed here ? > > I agree something looks fishy: nor mmap_sem for writing, nor the page > lock can stop split_huge_page_refcount. > > Now the mlock side was intended to be safe because mlock_vma_page is > called within follow_page while holding the PT lock or the > page_table_lock (so split_huge_page_refcount will have to wait for it > to be released before it can run). See follow_trans_huge_pmd > assert_spin_locked and the pte_unmap_unlock after mlock_vma_page > returns. > > Problem is, the lock side dependen on the TestSetPageMlocked below to > be always repeated on the head page (follow_trans_huge_pmd will always > pass the head page to mlock_vma_page). > > void mlock_vma_page(struct page *page) > { > BUG_ON(!PageLocked(page)); > > if (!TestSetPageMlocked(page)) { > > But what if the head page was split in between two different > follow_page calls? The second call wouldn't take the pmd_trans_huge > path anymore and the stats would be increased too much. Ah, good point. I am getting increasingly scared about the locking here the more I look at it :/ > The problem on the munlock side is even more apparent as you pointed > out above but now I think the mlock side was problematic too. > > The good thing is, your accelleration code for the mlock side should > have fixed the mlock race already: not ever risking to end up calling > mlock_vma_page twice on the head page is not an "accelleration" only, > it should also be a natural fix for the race. > > To fix the munlock side, which is still present, I think one way would > be to do mlock and unlock within get_user_pages, so they run in the > same place protected by the PT lock or page_table_lock. I actually had a quick try at this before submitting this patch series; the difficulty with it was that we want to have the page lock before calling munlock_vma_page() and we can't get it while holding the page table lock. So we'd need to do some gymnastics involving trylock_page() and if that fails, release the page table lock / lock the page / reacquire the page table lock / see if we locked the right page. This looked nasty, and then I noticed the existing hpage_nr_pages() test in munlock_vma_page() and decided to just ask you about it :) > There are few things that stop split_huge_page_refcount: > page_table_lock, lru_lock, compound_lock, anon_vma lock. So if we keep > calling munlock_vma_page outside of get_user_pages (so outside of the > page_table_lock) the other way would be to use the compound_lock. > > NOTE: this a purely aesthetical issue in /proc/meminfo, there's > nothing functional (at least in the kernel) connected to it, so no > panic :). Yes. I think we should try and get the locking right, because wrong code is just confusing, and it leads people to make incorrect assumptions and propagate them to places where it may have larger consequences. But other than that, I agree that the statistics issue we're talking about here doesn't sound too severe. I am in a bit of a bind here, as I will be taking a vacation abroad soon. I would like to get these 3 patches out before then, but I don't want to do anything too major as I'm not sure what kind of connectivity I'll have to fix things if needed. I think I'll go with the simplest option of adding the THP optimizaiton first and fixing the locking issues at a later stage... Thanks, -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies.