From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com [209.85.212.171]) by kanga.kvack.org (Postfix) with ESMTP id 0729E6B003A for ; Tue, 1 Jul 2014 13:07:46 -0400 (EDT) Received: by mail-wi0-f171.google.com with SMTP id n15so8232879wiw.4 for ; Tue, 01 Jul 2014 10:07:46 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id u10si28766906wjr.88.2014.07.01.10.07.44 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Jul 2014 10:07:45 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH v4 00/13] pagewalk: improve vma handling, apply to new users Date: Tue, 1 Jul 2014 13:07:18 -0400 Message-Id: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi This series is ver.4 of page table walker patchset. I reflected comments in the previous version (thanks Kirill, Jerome). And rebased onto v3.16-rc3. Recently code around queue_pages_range() is changed by commit d05f0cdcbe63 "mm: fix crashes from mbind() merging vma", which affects this series a little. Thanks, Naoya Horiguchi Tree: git@github.com:Naoya-Horiguchi/linux.git Branch: v3.16-rc3/page_table_walker.ver4 --- Summary: Kirill A. Shutemov (1): mm: /proc/pid/clear_refs: avoid split_huge_page() Naoya Horiguchi (12): mm/pagewalk: remove pgd_entry() and pud_entry() pagewalk: improve vma handling pagewalk: add walk_page_vma() smaps: remove mem_size_stats->vma and use walk_page_vma() clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk() pagemap: use walk->vma instead of calling find_vma() numa_maps: fix typo in gather_hugetbl_stats numa_maps: remove numa_maps->vma memcg: cleanup preparation for page table walk arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma() mempolicy: apply page table walker on queue_pages_range() mincore: apply page table walker on do_mincore() arch/powerpc/mm/subpage-prot.c | 6 +- fs/proc/task_mmu.c | 150 ++++++++++++++++----------- include/linux/mm.h | 22 ++-- mm/huge_memory.c | 20 ---- mm/memcontrol.c | 49 +++------ mm/mempolicy.c | 224 ++++++++++++++++------------------------ mm/mincore.c | 173 +++++++++++-------------------- mm/pagewalk.c | 228 ++++++++++++++++++++++++----------------- 8 files changed, 409 insertions(+), 463 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by kanga.kvack.org (Postfix) with ESMTP id CD85C6B003C for ; Tue, 1 Jul 2014 13:07:48 -0400 (EDT) Received: by mail-wi0-f180.google.com with SMTP id hi2so8191977wib.7 for ; Tue, 01 Jul 2014 10:07:48 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id k3si28819935wja.3.2014.07.01.10.07.45 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Jul 2014 10:07:45 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH v4 01/13] mm/pagewalk: remove pgd_entry() and pud_entry() Date: Tue, 1 Jul 2014 13:07:19 -0400 Message-Id: <1404234451-21695-2-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Currently no user of page table walker sets ->pgd_entry() or ->pud_entry(), so checking their existence in each loop is just wasting CPU cycle. So let's remove it to reduce overhead. Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- include/linux/mm.h | 6 ------ mm/pagewalk.c | 9 ++------- 2 files changed, 2 insertions(+), 13 deletions(-) diff --git v3.16-rc3.orig/include/linux/mm.h v3.16-rc3/include/linux/mm.h index e03dd29145a0..c5cb6394e6cb 100644 --- v3.16-rc3.orig/include/linux/mm.h +++ v3.16-rc3/include/linux/mm.h @@ -1100,8 +1100,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, /** * mm_walk - callbacks for walk_page_range - * @pgd_entry: if set, called for each non-empty PGD (top-level) entry - * @pud_entry: if set, called for each non-empty PUD (2nd-level) entry * @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry * this handler is required to be able to handle * pmd_trans_huge() pmds. They may simply choose to @@ -1115,10 +1113,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, * (see walk_page_range for more details) */ struct mm_walk { - int (*pgd_entry)(pgd_t *pgd, unsigned long addr, - unsigned long next, struct mm_walk *walk); - int (*pud_entry)(pud_t *pud, unsigned long addr, - unsigned long next, struct mm_walk *walk); int (*pmd_entry)(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk); int (*pte_entry)(pte_t *pte, unsigned long addr, diff --git v3.16-rc3.orig/mm/pagewalk.c v3.16-rc3/mm/pagewalk.c index 2beeabf502c5..335690650b12 100644 --- v3.16-rc3.orig/mm/pagewalk.c +++ v3.16-rc3/mm/pagewalk.c @@ -86,9 +86,7 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end, break; continue; } - if (walk->pud_entry) - err = walk->pud_entry(pud, addr, next, walk); - if (!err && (walk->pmd_entry || walk->pte_entry)) + if (walk->pmd_entry || walk->pte_entry) err = walk_pmd_range(pud, addr, next, walk); if (err) break; @@ -234,10 +232,7 @@ int walk_page_range(unsigned long addr, unsigned long end, pgd++; continue; } - if (walk->pgd_entry) - err = walk->pgd_entry(pgd, addr, next, walk); - if (!err && - (walk->pud_entry || walk->pmd_entry || walk->pte_entry)) + if (walk->pmd_entry || walk->pte_entry) err = walk_pud_range(pgd, addr, next, walk); if (err) break; -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f46.google.com (mail-wg0-f46.google.com [74.125.82.46]) by kanga.kvack.org (Postfix) with ESMTP id 0790F6B003D for ; Tue, 1 Jul 2014 13:07:48 -0400 (EDT) Received: by mail-wg0-f46.google.com with SMTP id y10so10072336wgg.29 for ; Tue, 01 Jul 2014 10:07:48 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id j18si8566710wiv.41.2014.07.01.10.07.47 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Jul 2014 10:07:47 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH v4 03/13] pagewalk: add walk_page_vma() Date: Tue, 1 Jul 2014 13:07:21 -0400 Message-Id: <1404234451-21695-4-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Introduces walk_page_vma(), which is useful for the callers which want to walk over a given vma. It's used by later patches. ChangeLog v3: - check walk_page_test's return value instead of walk->skip Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- include/linux/mm.h | 1 + mm/pagewalk.c | 18 ++++++++++++++++++ 2 files changed, 19 insertions(+) diff --git v3.16-rc3.orig/include/linux/mm.h v3.16-rc3/include/linux/mm.h index 489a63a06a4a..7e9287750866 100644 --- v3.16-rc3.orig/include/linux/mm.h +++ v3.16-rc3/include/linux/mm.h @@ -1137,6 +1137,7 @@ struct mm_walk { int walk_page_range(unsigned long addr, unsigned long end, struct mm_walk *walk); +int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk); void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, diff --git v3.16-rc3.orig/mm/pagewalk.c v3.16-rc3/mm/pagewalk.c index 91810ba875ea..65fb68df3aa2 100644 --- v3.16-rc3.orig/mm/pagewalk.c +++ v3.16-rc3/mm/pagewalk.c @@ -272,3 +272,21 @@ int walk_page_range(unsigned long start, unsigned long end, } while (start = next, start < end); return err; } + +int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk) +{ + int err; + + if (!walk->mm) + return -EINVAL; + + VM_BUG_ON(!rwsem_is_locked(&walk->mm->mmap_sem)); + VM_BUG_ON(!vma); + walk->vma = vma; + err = walk_page_test(vma->vm_start, vma->vm_end, walk); + if (err > 0) + return 0; + if (err < 0) + return err; + return __walk_page_range(vma->vm_start, vma->vm_end, walk); +} -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com [209.85.212.179]) by kanga.kvack.org (Postfix) with ESMTP id 6A87D6B0044 for ; Tue, 1 Jul 2014 13:07:53 -0400 (EDT) Received: by mail-wi0-f179.google.com with SMTP id cc10so8134157wib.12 for ; Tue, 01 Jul 2014 10:07:52 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id ef3si15989806wic.104.2014.07.01.10.07.52 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Jul 2014 10:07:52 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH v4 06/13] pagemap: use walk->vma instead of calling find_vma() Date: Tue, 1 Jul 2014 13:07:24 -0400 Message-Id: <1404234451-21695-7-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Page table walker has the information of the current vma in mm_walk, so we don't have to call find_vma() in each pagemap_hugetlb_range() call. NULL-vma check is omitted because we assume that we never run hugetlb_entry() callback on the address without vma. And even if it were broken, null pointer dereference would be detected, so we can get enough information for debugging. Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- fs/proc/task_mmu.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git v3.16-rc3.orig/fs/proc/task_mmu.c v3.16-rc3/fs/proc/task_mmu.c index df9f368e01b7..5ebc238d1a38 100644 --- v3.16-rc3.orig/fs/proc/task_mmu.c +++ v3.16-rc3/fs/proc/task_mmu.c @@ -1080,15 +1080,12 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask, struct mm_walk *walk) { struct pagemapread *pm = walk->private; - struct vm_area_struct *vma; + struct vm_area_struct *vma = walk->vma; int err = 0; int flags2; pagemap_entry_t pme; - vma = find_vma(walk->mm, addr); - WARN_ON_ONCE(!vma); - - if (vma && (vma->vm_flags & VM_SOFTDIRTY)) + if (vma->vm_flags & VM_SOFTDIRTY) flags2 = __PM_SOFT_DIRTY; else flags2 = 0; -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f48.google.com (mail-wg0-f48.google.com [74.125.82.48]) by kanga.kvack.org (Postfix) with ESMTP id DACEC6B004D for ; Tue, 1 Jul 2014 13:07:54 -0400 (EDT) Received: by mail-wg0-f48.google.com with SMTP id n12so9762040wgh.19 for ; Tue, 01 Jul 2014 10:07:52 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id bm6si16027366wib.40.2014.07.01.10.07.51 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Jul 2014 10:07:51 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH v4 05/13] clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk() Date: Tue, 1 Jul 2014 13:07:23 -0400 Message-Id: <1404234451-21695-6-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi clear_refs_write() has some prechecks to determine if we really walk over a given vma. Now we have a test_walk() callback to filter vmas, so let's utilize it. ChangeLog v4: - use walk_page_range instead of walk_page_vma with for loop Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- fs/proc/task_mmu.c | 54 ++++++++++++++++++++++++++---------------------------- 1 file changed, 26 insertions(+), 28 deletions(-) diff --git v3.16-rc3.orig/fs/proc/task_mmu.c v3.16-rc3/fs/proc/task_mmu.c index 3067bf08393b..df9f368e01b7 100644 --- v3.16-rc3.orig/fs/proc/task_mmu.c +++ v3.16-rc3/fs/proc/task_mmu.c @@ -715,7 +715,6 @@ enum clear_refs_types { }; struct clear_refs_private { - struct vm_area_struct *vma; enum clear_refs_types type; }; @@ -748,7 +747,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct clear_refs_private *cp = walk->private; - struct vm_area_struct *vma = cp->vma; + struct vm_area_struct *vma = walk->vma; pte_t *pte, ptent; spinlock_t *ptl; struct page *page; @@ -782,6 +781,29 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, return 0; } +static int clear_refs_test_walk(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + struct clear_refs_private *cp = walk->private; + struct vm_area_struct *vma = walk->vma; + + /* + * Writing 1 to /proc/pid/clear_refs affects all pages. + * Writing 2 to /proc/pid/clear_refs only affects anonymous pages. + * Writing 3 to /proc/pid/clear_refs only affects file mapped pages. + * Writing 4 to /proc/pid/clear_refs affects all pages. + */ + if (cp->type == CLEAR_REFS_ANON && vma->vm_file) + return 1; + if (cp->type == CLEAR_REFS_MAPPED && !vma->vm_file) + return 1; + if (cp->type == CLEAR_REFS_SOFT_DIRTY) { + if (vma->vm_flags & VM_SOFTDIRTY) + vma->vm_flags &= ~VM_SOFTDIRTY; + } + return 0; +} + static ssize_t clear_refs_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { @@ -822,38 +844,14 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, }; struct mm_walk clear_refs_walk = { .pmd_entry = clear_refs_pte_range, + .test_walk = clear_refs_test_walk, .mm = mm, .private = &cp, }; down_read(&mm->mmap_sem); if (type == CLEAR_REFS_SOFT_DIRTY) mmu_notifier_invalidate_range_start(mm, 0, -1); - for (vma = mm->mmap; vma; vma = vma->vm_next) { - cp.vma = vma; - if (is_vm_hugetlb_page(vma)) - continue; - /* - * Writing 1 to /proc/pid/clear_refs affects all pages. - * - * Writing 2 to /proc/pid/clear_refs only affects - * Anonymous pages. - * - * Writing 3 to /proc/pid/clear_refs only affects file - * mapped pages. - * - * Writing 4 to /proc/pid/clear_refs affects all pages. - */ - if (type == CLEAR_REFS_ANON && vma->vm_file) - continue; - if (type == CLEAR_REFS_MAPPED && !vma->vm_file) - continue; - if (type == CLEAR_REFS_SOFT_DIRTY) { - if (vma->vm_flags & VM_SOFTDIRTY) - vma->vm_flags &= ~VM_SOFTDIRTY; - } - walk_page_range(vma->vm_start, vma->vm_end, - &clear_refs_walk); - } + walk_page_range(0, ~0UL, &clear_refs_walk); if (type == CLEAR_REFS_SOFT_DIRTY) mmu_notifier_invalidate_range_end(mm, 0, -1); flush_tlb_mm(mm); -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yk0-f178.google.com (mail-yk0-f178.google.com [209.85.160.178]) by kanga.kvack.org (Postfix) with ESMTP id 25B6B6B0055 for ; Tue, 1 Jul 2014 13:07:55 -0400 (EDT) Received: by mail-yk0-f178.google.com with SMTP id q9so5824328ykb.23 for ; Tue, 01 Jul 2014 10:07:54 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id h13si21471550yha.163.2014.07.01.10.07.53 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Jul 2014 10:07:54 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH v4 04/13] smaps: remove mem_size_stats->vma and use walk_page_vma() Date: Tue, 1 Jul 2014 13:07:22 -0400 Message-Id: <1404234451-21695-5-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And show_smap() walks pages on vma basis, so using walk_page_vma() is preferable. ChangeLog v4: - remove redundant vma Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- fs/proc/task_mmu.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git v3.16-rc3.orig/fs/proc/task_mmu.c v3.16-rc3/fs/proc/task_mmu.c index cfa63ee92c96..3067bf08393b 100644 --- v3.16-rc3.orig/fs/proc/task_mmu.c +++ v3.16-rc3/fs/proc/task_mmu.c @@ -430,7 +430,6 @@ const struct file_operations proc_tid_maps_operations = { #ifdef CONFIG_PROC_PAGE_MONITOR struct mem_size_stats { - struct vm_area_struct *vma; unsigned long resident; unsigned long shared_clean; unsigned long shared_dirty; @@ -449,7 +448,7 @@ static void smaps_pte_entry(pte_t ptent, unsigned long addr, unsigned long ptent_size, struct mm_walk *walk) { struct mem_size_stats *mss = walk->private; - struct vm_area_struct *vma = mss->vma; + struct vm_area_struct *vma = walk->vma; pgoff_t pgoff = linear_page_index(vma, addr); struct page *page = NULL; int mapcount; @@ -501,7 +500,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct mem_size_stats *mss = walk->private; - struct vm_area_struct *vma = mss->vma; + struct vm_area_struct *vma = walk->vma; pte_t *pte; spinlock_t *ptl; @@ -594,10 +593,8 @@ static int show_smap(struct seq_file *m, void *v, int is_pid) }; memset(&mss, 0, sizeof mss); - mss.vma = vma; /* mmap_sem is held in m_start */ - if (vma->vm_mm && !is_vm_hugetlb_page(vma)) - walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk); + walk_page_vma(vma, &smaps_walk); show_map_vma(m, vma, is_pid); -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f43.google.com (mail-qg0-f43.google.com [209.85.192.43]) by kanga.kvack.org (Postfix) with ESMTP id 762A46B004D for ; Tue, 1 Jul 2014 13:07:55 -0400 (EDT) Received: by mail-qg0-f43.google.com with SMTP id z60so3606701qgd.2 for ; Tue, 01 Jul 2014 10:07:55 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id a1si29989978qas.126.2014.07.01.10.07.53 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Jul 2014 10:07:54 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH v4 09/13] memcg: cleanup preparation for page table walk Date: Tue, 1 Jul 2014 13:07:27 -0400 Message-Id: <1404234451-21695-10-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And both of mem_cgroup_count_precharge() and mem_cgroup_move_charge() do for each vma loop themselves, but now it's done in pagewalk.c, so let's clean up them. ChangeLog v4: - use walk_page_range() instead of walk_page_vma() with for loop. Signed-off-by: Naoya Horiguchi --- mm/memcontrol.c | 49 ++++++++++++++++--------------------------------- 1 file changed, 16 insertions(+), 33 deletions(-) diff --git v3.16-rc3.orig/mm/memcontrol.c v3.16-rc3/mm/memcontrol.c index a2c7bcb0e6eb..6c075113c363 100644 --- v3.16-rc3.orig/mm/memcontrol.c +++ v3.16-rc3/mm/memcontrol.c @@ -6654,7 +6654,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { - struct vm_area_struct *vma = walk->private; + struct vm_area_struct *vma = walk->vma; pte_t *pte; spinlock_t *ptl; @@ -6680,20 +6680,13 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm) { unsigned long precharge; - struct vm_area_struct *vma; + struct mm_walk mem_cgroup_count_precharge_walk = { + .pmd_entry = mem_cgroup_count_precharge_pte_range, + .mm = mm, + }; down_read(&mm->mmap_sem); - for (vma = mm->mmap; vma; vma = vma->vm_next) { - struct mm_walk mem_cgroup_count_precharge_walk = { - .pmd_entry = mem_cgroup_count_precharge_pte_range, - .mm = mm, - .private = vma, - }; - if (is_vm_hugetlb_page(vma)) - continue; - walk_page_range(vma->vm_start, vma->vm_end, - &mem_cgroup_count_precharge_walk); - } + walk_page_range(0, ~0UL, &mem_cgroup_count_precharge_walk); up_read(&mm->mmap_sem); precharge = mc.precharge; @@ -6832,7 +6825,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, struct mm_walk *walk) { int ret = 0; - struct vm_area_struct *vma = walk->private; + struct vm_area_struct *vma = walk->vma; pte_t *pte; spinlock_t *ptl; enum mc_target_type target_type; @@ -6932,7 +6925,10 @@ put: /* get_mctgt_type() gets the page */ static void mem_cgroup_move_charge(struct mm_struct *mm) { - struct vm_area_struct *vma; + struct mm_walk mem_cgroup_move_charge_walk = { + .pmd_entry = mem_cgroup_move_charge_pte_range, + .mm = mm, + }; lru_add_drain_all(); retry: @@ -6948,24 +6944,11 @@ static void mem_cgroup_move_charge(struct mm_struct *mm) cond_resched(); goto retry; } - for (vma = mm->mmap; vma; vma = vma->vm_next) { - int ret; - struct mm_walk mem_cgroup_move_charge_walk = { - .pmd_entry = mem_cgroup_move_charge_pte_range, - .mm = mm, - .private = vma, - }; - if (is_vm_hugetlb_page(vma)) - continue; - ret = walk_page_range(vma->vm_start, vma->vm_end, - &mem_cgroup_move_charge_walk); - if (ret) - /* - * means we have consumed all precharges and failed in - * doing additional charge. Just abandon here. - */ - break; - } + /* + * When we have consumed all precharges and failed in doing + * additional charge, the page walk just aborts. + */ + walk_page_range(0, ~0UL, &mem_cgroup_move_charge_walk); up_read(&mm->mmap_sem); } -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f171.google.com (mail-we0-f171.google.com [74.125.82.171]) by kanga.kvack.org (Postfix) with ESMTP id 98FB16B0055 for ; Tue, 1 Jul 2014 13:07:55 -0400 (EDT) Received: by mail-we0-f171.google.com with SMTP id q58so9988620wes.16 for ; Tue, 01 Jul 2014 10:07:54 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id m4si16026398wiy.39.2014.07.01.10.07.53 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Jul 2014 10:07:54 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH v4 08/13] numa_maps: remove numa_maps->vma Date: Tue, 1 Jul 2014 13:07:26 -0400 Message-Id: <1404234451-21695-9-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And show_numa_map() walks pages on vma basis, so using walk_page_vma() is preferable. ChangeLog v4: - remove redundant vma Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- fs/proc/task_mmu.c | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-) diff --git v3.16-rc3.orig/fs/proc/task_mmu.c v3.16-rc3/fs/proc/task_mmu.c index 0d3d1ac32b2e..4ca28f401bb1 100644 --- v3.16-rc3.orig/fs/proc/task_mmu.c +++ v3.16-rc3/fs/proc/task_mmu.c @@ -1245,7 +1245,6 @@ const struct file_operations proc_pagemap_operations = { #ifdef CONFIG_NUMA struct numa_maps { - struct vm_area_struct *vma; unsigned long pages; unsigned long anon; unsigned long active; @@ -1314,18 +1313,17 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma, static int gather_pte_stats(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { - struct numa_maps *md; + struct numa_maps *md = walk->private; + struct vm_area_struct *vma = walk->vma; spinlock_t *ptl; pte_t *orig_pte; pte_t *pte; - md = walk->private; - - if (pmd_trans_huge_lock(pmd, md->vma, &ptl) == 1) { + if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pte_t huge_pte = *(pte_t *)pmd; struct page *page; - page = can_gather_numa_stats(huge_pte, md->vma, addr); + page = can_gather_numa_stats(huge_pte, vma, addr); if (page) gather_stats(page, md, pte_dirty(huge_pte), HPAGE_PMD_SIZE/PAGE_SIZE); @@ -1337,7 +1335,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, return 0; orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); do { - struct page *page = can_gather_numa_stats(*pte, md->vma, addr); + struct page *page = can_gather_numa_stats(*pte, vma, addr); if (!page) continue; gather_stats(page, md, pte_dirty(*pte), 1); @@ -1385,7 +1383,12 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) struct file *file = vma->vm_file; struct task_struct *task = proc_priv->task; struct mm_struct *mm = vma->vm_mm; - struct mm_walk walk = {}; + struct mm_walk walk = { + .hugetlb_entry = gather_hugetlb_stats, + .pmd_entry = gather_pte_stats, + .private = md, + .mm = mm, + }; struct mempolicy *pol; char buffer[64]; int nid; @@ -1396,13 +1399,6 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) /* Ensure we start with an empty set of numa_maps statistics. */ memset(md, 0, sizeof(*md)); - md->vma = vma; - - walk.hugetlb_entry = gather_hugetlb_stats; - walk.pmd_entry = gather_pte_stats; - walk.private = md; - walk.mm = mm; - pol = get_vma_policy(task, vma, vma->vm_start); mpol_to_str(buffer, sizeof(buffer), pol); mpol_cond_put(pol); @@ -1432,7 +1428,8 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) if (is_vm_hugetlb_page(vma)) seq_puts(m, " huge"); - walk_page_range(vma->vm_start, vma->vm_end, &walk); + /* mmap_sem is held by m_start */ + walk_page_vma(vma, &walk); if (!md->pages) goto out; -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f181.google.com (mail-wi0-f181.google.com [209.85.212.181]) by kanga.kvack.org (Postfix) with ESMTP id 565C86B0068 for ; Tue, 1 Jul 2014 13:07:56 -0400 (EDT) Received: by mail-wi0-f181.google.com with SMTP id n3so8197820wiv.14 for ; Tue, 01 Jul 2014 10:07:55 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id u4si28715476wjy.175.2014.07.01.10.07.54 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Jul 2014 10:07:55 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH v4 11/13] mempolicy: apply page table walker on queue_pages_range() Date: Tue, 1 Jul 2014 13:07:29 -0400 Message-Id: <1404234451-21695-12-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi queue_pages_range() does page table walking in its own way now, but there is some code duplicate. This patch applies page table walker to reduce lines of code. queue_pages_range() has to do some precheck to determine whether we really walk over the vma or just skip it. Now we have test_walk() callback in mm_walk for this purpose, so we can do this replacement cleanly. queue_pages_test_walk() depends on not only the current vma but also the previous one, so queue_pages->prev is introduced to remember it. ChangeLog v4: - rebase to v3.16-rc3, where the return value of queue_pages_range() becomes 0 in success instead of the first found vma, and use -EFAILT instead of ERR_PTR() in failure. Signed-off-by: Naoya Horiguchi --- mm/mempolicy.c | 224 +++++++++++++++++++++++---------------------------------- 1 file changed, 90 insertions(+), 134 deletions(-) diff --git v3.16-rc3.orig/mm/mempolicy.c v3.16-rc3/mm/mempolicy.c index eb58de19f815..47a6fa913b83 100644 --- v3.16-rc3.orig/mm/mempolicy.c +++ v3.16-rc3/mm/mempolicy.c @@ -479,24 +479,34 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { static void migrate_page_add(struct page *page, struct list_head *pagelist, unsigned long flags); +struct queue_pages { + struct list_head *pagelist; + unsigned long flags; + nodemask_t *nmask; + struct vm_area_struct *prev; +}; + /* * Scan through pages checking if pages follow certain conditions, * and move them to the pagelist if they do. */ -static int queue_pages_pte_range(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, unsigned long end, - const nodemask_t *nodes, unsigned long flags, - void *private) +static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, + unsigned long end, struct mm_walk *walk) { - pte_t *orig_pte; + struct vm_area_struct *vma = walk->vma; + struct page *page; + struct queue_pages *qp = walk->private; + unsigned long flags = qp->flags; + int nid; pte_t *pte; spinlock_t *ptl; - orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); - do { - struct page *page; - int nid; + split_huge_page_pmd(vma, addr, pmd); + if (pmd_trans_unstable(pmd)) + return 0; + pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + for (; addr != end; pte++, addr += PAGE_SIZE) { if (!pte_present(*pte)) continue; page = vm_normal_page(vma, addr, *pte); @@ -509,114 +519,46 @@ static int queue_pages_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (PageReserved(page)) continue; nid = page_to_nid(page); - if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT)) + if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT)) continue; if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) - migrate_page_add(page, private, flags); - else - break; - } while (pte++, addr += PAGE_SIZE, addr != end); - pte_unmap_unlock(orig_pte, ptl); - return addr != end; + migrate_page_add(page, qp->pagelist, flags); + } + pte_unmap_unlock(pte - 1, ptl); + cond_resched(); + return 0; } -static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma, - pmd_t *pmd, const nodemask_t *nodes, unsigned long flags, - void *private) +static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask, + unsigned long addr, unsigned long end, + struct mm_walk *walk) { #ifdef CONFIG_HUGETLB_PAGE + struct queue_pages *qp = walk->private; + unsigned long flags = qp->flags; int nid; struct page *page; spinlock_t *ptl; pte_t entry; - ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd); - entry = huge_ptep_get((pte_t *)pmd); + ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte); + entry = huge_ptep_get(pte); if (!pte_present(entry)) goto unlock; page = pte_page(entry); nid = page_to_nid(page); - if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT)) + if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT)) goto unlock; /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */ if (flags & (MPOL_MF_MOVE_ALL) || (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) - isolate_huge_page(page, private); + isolate_huge_page(page, qp->pagelist); unlock: spin_unlock(ptl); #else BUG(); #endif -} - -static inline int queue_pages_pmd_range(struct vm_area_struct *vma, pud_t *pud, - unsigned long addr, unsigned long end, - const nodemask_t *nodes, unsigned long flags, - void *private) -{ - pmd_t *pmd; - unsigned long next; - - pmd = pmd_offset(pud, addr); - do { - next = pmd_addr_end(addr, end); - if (!pmd_present(*pmd)) - continue; - if (pmd_huge(*pmd) && is_vm_hugetlb_page(vma)) { - queue_pages_hugetlb_pmd_range(vma, pmd, nodes, - flags, private); - continue; - } - split_huge_page_pmd(vma, addr, pmd); - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) - continue; - if (queue_pages_pte_range(vma, pmd, addr, next, nodes, - flags, private)) - return -EIO; - } while (pmd++, addr = next, addr != end); - return 0; -} - -static inline int queue_pages_pud_range(struct vm_area_struct *vma, pgd_t *pgd, - unsigned long addr, unsigned long end, - const nodemask_t *nodes, unsigned long flags, - void *private) -{ - pud_t *pud; - unsigned long next; - - pud = pud_offset(pgd, addr); - do { - next = pud_addr_end(addr, end); - if (pud_huge(*pud) && is_vm_hugetlb_page(vma)) - continue; - if (pud_none_or_clear_bad(pud)) - continue; - if (queue_pages_pmd_range(vma, pud, addr, next, nodes, - flags, private)) - return -EIO; - } while (pud++, addr = next, addr != end); - return 0; -} - -static inline int queue_pages_pgd_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, - const nodemask_t *nodes, unsigned long flags, - void *private) -{ - pgd_t *pgd; - unsigned long next; - - pgd = pgd_offset(vma->vm_mm, addr); - do { - next = pgd_addr_end(addr, end); - if (pgd_none_or_clear_bad(pgd)) - continue; - if (queue_pages_pud_range(vma, pgd, addr, next, nodes, - flags, private)) - return -EIO; - } while (pgd++, addr = next, addr != end); return 0; } @@ -649,6 +591,44 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma, } #endif /* CONFIG_NUMA_BALANCING */ +static int queue_pages_test_walk(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + struct vm_area_struct *vma = walk->vma; + struct queue_pages *qp = walk->private; + unsigned long endvma = vma->vm_end; + unsigned long flags = qp->flags; + + if (endvma > end) + endvma = end; + if (vma->vm_start > start) + start = vma->vm_start; + + if (!(flags & MPOL_MF_DISCONTIG_OK)) { + if (!vma->vm_next && vma->vm_end < end) + return -EFAULT; + if (qp->prev && qp->prev->vm_end < vma->vm_start) + return -EFAULT; + } + + qp->prev = vma; + + if (vma->vm_flags & VM_PFNMAP) + return 1; + + if (flags & MPOL_MF_LAZY) { + change_prot_numa(vma, start, endvma); + return 1; + } + + if ((flags & MPOL_MF_STRICT) || + ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) && + vma_migratable(vma))) + /* queue pages from current vma */ + return 0; + return 1; +} + /* * Walk through page tables and collect pages to be migrated. * @@ -658,48 +638,24 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma, */ static int queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, - const nodemask_t *nodes, unsigned long flags, void *private) -{ - int err = 0; - struct vm_area_struct *vma, *prev; - - vma = find_vma(mm, start); - if (!vma) - return -EFAULT; - prev = NULL; - for (; vma && vma->vm_start < end; vma = vma->vm_next) { - unsigned long endvma = vma->vm_end; - - if (endvma > end) - endvma = end; - if (vma->vm_start > start) - start = vma->vm_start; - - if (!(flags & MPOL_MF_DISCONTIG_OK)) { - if (!vma->vm_next && vma->vm_end < end) - return -EFAULT; - if (prev && prev->vm_end < vma->vm_start) - return -EFAULT; - } - - if (flags & MPOL_MF_LAZY) { - change_prot_numa(vma, start, endvma); - goto next; - } - - if ((flags & MPOL_MF_STRICT) || - ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) && - vma_migratable(vma))) { - - err = queue_pages_pgd_range(vma, start, endvma, nodes, - flags, private); - if (err) - break; - } -next: - prev = vma; - } - return err; + nodemask_t *nodes, unsigned long flags, + struct list_head *pagelist) +{ + struct queue_pages qp = { + .pagelist = pagelist, + .flags = flags, + .nmask = nodes, + .prev = NULL, + }; + struct mm_walk queue_pages_walk = { + .hugetlb_entry = queue_pages_hugetlb, + .pmd_entry = queue_pages_pte_range, + .test_walk = queue_pages_test_walk, + .mm = mm, + .private = &qp, + }; + + return walk_page_range(start, end, &queue_pages_walk); } /* -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com [209.85.212.178]) by kanga.kvack.org (Postfix) with ESMTP id BD9926B0068 for ; Tue, 1 Jul 2014 13:07:58 -0400 (EDT) Received: by mail-wi0-f178.google.com with SMTP id n15so8164976wiw.11 for ; Tue, 01 Jul 2014 10:07:58 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id fv8si16005044wib.73.2014.07.01.10.07.57 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Jul 2014 10:07:57 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH v4 10/13] arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma() Date: Tue, 1 Jul 2014 13:07:28 -0400 Message-Id: <1404234451-21695-11-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi We don't have to use mm_walk->private to pass vma to the callback function because of mm_walk->vma. And walk_page_vma() is useful if we walk over a single vma. Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- arch/powerpc/mm/subpage-prot.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git v3.16-rc3.orig/arch/powerpc/mm/subpage-prot.c v3.16-rc3/arch/powerpc/mm/subpage-prot.c index 6c0b1f5f8d2c..fa9fb5b4c66c 100644 --- v3.16-rc3.orig/arch/powerpc/mm/subpage-prot.c +++ v3.16-rc3/arch/powerpc/mm/subpage-prot.c @@ -134,7 +134,7 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len) static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { - struct vm_area_struct *vma = walk->private; + struct vm_area_struct *vma = walk->vma; split_huge_page_pmd(vma, addr, pmd); return 0; } @@ -163,9 +163,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr, if (vma->vm_start >= (addr + len)) break; vma->vm_flags |= VM_NOHUGEPAGE; - subpage_proto_walk.private = vma; - walk_page_range(vma->vm_start, vma->vm_end, - &subpage_proto_walk); + walk_page_vma(vma, &subpage_proto_walk); vma = vma->vm_next; } } -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f170.google.com (mail-we0-f170.google.com [74.125.82.170]) by kanga.kvack.org (Postfix) with ESMTP id E82936B0069 for ; Tue, 1 Jul 2014 13:07:58 -0400 (EDT) Received: by mail-we0-f170.google.com with SMTP id w61so9886908wes.1 for ; Tue, 01 Jul 2014 10:07:58 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id cu9si16006695wib.74.2014.07.01.10.07.50 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Jul 2014 10:07:51 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH v4 02/13] pagewalk: improve vma handling Date: Tue, 1 Jul 2014 13:07:20 -0400 Message-Id: <1404234451-21695-3-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Current implementation of page table walker has a fundamental problem in vma handling, which started when we tried to handle vma(VM_HUGETLB). Because it's done in pgd loop, considering vma boundary makes code complicated and bug-prone. >>From the users viewpoint, some user checks some vma-related condition to determine whether the user really does page walk over the vma. In order to solve these, this patch moves vma check outside pgd loop and introduce a new callback ->test_walk(). ChangeLog v4: - avoid walking over the regions where vma is NULL if pte_hole() is undefined - use vma->vm_next instead of repeating find_vma() - use min() in walk_page_range "outside vma" branch - fix return value of walk_hugetlb_range() ChangeLog v3: - drop walk->skip control Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- include/linux/mm.h | 15 +++- mm/pagewalk.c | 203 ++++++++++++++++++++++++++++++----------------------- 2 files changed, 129 insertions(+), 89 deletions(-) diff --git v3.16-rc3.orig/include/linux/mm.h v3.16-rc3/include/linux/mm.h index c5cb6394e6cb..489a63a06a4a 100644 --- v3.16-rc3.orig/include/linux/mm.h +++ v3.16-rc3/include/linux/mm.h @@ -1107,10 +1107,16 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, * @pte_entry: if set, called for each non-empty PTE (4th-level) entry * @pte_hole: if set, called for each hole at all levels * @hugetlb_entry: if set, called for each hugetlb entry - * *Caution*: The caller must hold mmap_sem() if @hugetlb_entry - * is used. + * @test_walk: caller specific callback function to determine whether + * we walk over the current vma or not. A positive returned + * value means "do page table walk over the current vma," + * and a negative one means "abort current page table walk + * right now." 0 means "skip the current vma." + * @mm: mm_struct representing the target process of page table walk + * @vma: vma currently walked (NULL if walking outside vmas) + * @private: private data for callbacks' usage * - * (see walk_page_range for more details) + * (see the comment on walk_page_range() for more details) */ struct mm_walk { int (*pmd_entry)(pmd_t *pmd, unsigned long addr, @@ -1122,7 +1128,10 @@ struct mm_walk { int (*hugetlb_entry)(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long next, struct mm_walk *walk); + int (*test_walk)(unsigned long addr, unsigned long next, + struct mm_walk *walk); struct mm_struct *mm; + struct vm_area_struct *vma; void *private; }; diff --git v3.16-rc3.orig/mm/pagewalk.c v3.16-rc3/mm/pagewalk.c index 335690650b12..91810ba875ea 100644 --- v3.16-rc3.orig/mm/pagewalk.c +++ v3.16-rc3/mm/pagewalk.c @@ -59,7 +59,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, continue; split_huge_page_pmd_mm(walk->mm, addr, pmd); - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) + if (pmd_trans_unstable(pmd)) goto again; err = walk_pte_range(pmd, addr, next, walk); if (err) @@ -95,6 +95,32 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end, return err; } +static int walk_pgd_range(unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + pgd_t *pgd; + unsigned long next; + int err = 0; + + pgd = pgd_offset(walk->mm, addr); + do { + next = pgd_addr_end(addr, end); + if (pgd_none_or_clear_bad(pgd)) { + if (walk->pte_hole) + err = walk->pte_hole(addr, next, walk); + if (err) + break; + continue; + } + if (walk->pmd_entry || walk->pte_entry) + err = walk_pud_range(pgd, addr, next, walk); + if (err) + break; + } while (pgd++, addr = next, addr != end); + + return err; +} + #ifdef CONFIG_HUGETLB_PAGE static unsigned long hugetlb_entry_end(struct hstate *h, unsigned long addr, unsigned long end) @@ -103,10 +129,10 @@ static unsigned long hugetlb_entry_end(struct hstate *h, unsigned long addr, return boundary < end ? boundary : end; } -static int walk_hugetlb_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, +static int walk_hugetlb_range(unsigned long addr, unsigned long end, struct mm_walk *walk) { + struct vm_area_struct *vma = walk->vma; struct hstate *h = hstate_vma(vma); unsigned long next; unsigned long hmask = huge_page_mask(h); @@ -119,15 +145,14 @@ static int walk_hugetlb_range(struct vm_area_struct *vma, if (pte && walk->hugetlb_entry) err = walk->hugetlb_entry(pte, hmask, addr, next, walk); if (err) - return err; + break; } while (addr = next, addr != end); - return 0; + return err; } #else /* CONFIG_HUGETLB_PAGE */ -static int walk_hugetlb_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, +static int walk_hugetlb_range(unsigned long addr, unsigned long end, struct mm_walk *walk) { return 0; @@ -135,109 +160,115 @@ static int walk_hugetlb_range(struct vm_area_struct *vma, #endif /* CONFIG_HUGETLB_PAGE */ +/* + * Decide whether we really walk over the current vma on [@start, @end) + * or skip it via the returned value. Return 0 if we do walk over the + * current vma, and return 1 if we skip the vma. Negative values means + * error, where we abort the current walk. + * + * Default check (only VM_PFNMAP check for now) is used when the caller + * doesn't define test_walk() callback. + */ +static int walk_page_test(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + struct vm_area_struct *vma = walk->vma; + if (walk->test_walk) + return walk->test_walk(start, end, walk); + + /* + * Do not walk over vma(VM_PFNMAP), because we have no valid struct + * page backing a VM_PFNMAP range. See also commit a9ff785e4437. + */ + if (vma->vm_flags & VM_PFNMAP) + return 1; + return 0; +} + +static int __walk_page_range(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + int err = 0; + struct vm_area_struct *vma = walk->vma; + + if (vma && is_vm_hugetlb_page(vma)) { + if (walk->hugetlb_entry) + err = walk_hugetlb_range(start, end, walk); + } else + err = walk_pgd_range(start, end, walk); + + return err; +} /** - * walk_page_range - walk a memory map's page tables with a callback - * @addr: starting address - * @end: ending address - * @walk: set of callbacks to invoke for each level of the tree - * - * Recursively walk the page table for the memory area in a VMA, - * calling supplied callbacks. Callbacks are called in-order (first - * PGD, first PUD, first PMD, first PTE, second PTE... second PMD, - * etc.). If lower-level callbacks are omitted, walking depth is reduced. + * walk_page_range - walk page table with caller specific callbacks * - * Each callback receives an entry pointer and the start and end of the - * associated range, and a copy of the original mm_walk for access to - * the ->private or ->mm fields. + * Recursively walk the page table tree of the process represented by @walk->mm + * within the virtual address range [@start, @end). During walking, we can do + * some caller-specific works for each entry, by setting up pmd_entry(), + * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these + * callbacks, the associated entries/pages are just ignored. + * The return values of these callbacks are commonly defined like below: + * - 0 : succeeded to handle the current entry, and if you don't reach the + * end address yet, continue to walk. + * - >0 : succeeded to handle the current entry, and return to the caller + * with caller specific value. + * - <0 : failed to handle the current entry, and return to the caller + * with error code. * - * Usually no locks are taken, but splitting transparent huge page may - * take page table lock. And the bottom level iterator will map PTE - * directories from highmem if necessary. + * Before starting to walk page table, some callers want to check whether + * they really want to walk over the current vma, typically by checking + * its vm_flags. walk_page_test() and @walk->test_walk() are used for this + * purpose. * - * If any callback returns a non-zero value, the walk is aborted and - * the return value is propagated back to the caller. Otherwise 0 is returned. + * struct mm_walk keeps current values of some common data like vma and pmd, + * which are useful for the access from callbacks. If you want to pass some + * caller-specific data to callbacks, @walk->private should be helpful. * - * walk->mm->mmap_sem must be held for at least read if walk->hugetlb_entry - * is !NULL. + * Locking: + * Callers of walk_page_range() and walk_page_vma() should hold + * @walk->mm->mmap_sem, because these function traverse vma list and/or + * access to vma's data. */ -int walk_page_range(unsigned long addr, unsigned long end, +int walk_page_range(unsigned long start, unsigned long end, struct mm_walk *walk) { - pgd_t *pgd; - unsigned long next; int err = 0; + unsigned long next; + struct vm_area_struct *vma; - if (addr >= end) - return err; + if (start >= end) + return -EINVAL; if (!walk->mm) return -EINVAL; VM_BUG_ON(!rwsem_is_locked(&walk->mm->mmap_sem)); - pgd = pgd_offset(walk->mm, addr); + vma = find_vma(walk->mm, start); do { - struct vm_area_struct *vma = NULL; + if (!vma) { /* after the last vma */ + walk->vma = NULL; + next = end; + } else if (start < vma->vm_start) { /* outside vma */ + walk->vma = NULL; + next = min(end, vma->vm_start); + } else { /* inside vma */ + walk->vma = vma; + next = min(end, vma->vm_end); + vma = vma->vm_next; - next = pgd_addr_end(addr, end); - - /* - * This function was not intended to be vma based. - * But there are vma special cases to be handled: - * - hugetlb vma's - * - VM_PFNMAP vma's - */ - vma = find_vma(walk->mm, addr); - if (vma) { - /* - * There are no page structures backing a VM_PFNMAP - * range, so do not allow split_huge_page_pmd(). - */ - if ((vma->vm_start <= addr) && - (vma->vm_flags & VM_PFNMAP)) { - next = vma->vm_end; - pgd = pgd_offset(walk->mm, next); + err = walk_page_test(start, next, walk); + if (err > 0) continue; - } - /* - * Handle hugetlb vma individually because pagetable - * walk for the hugetlb page is dependent on the - * architecture and we can't handled it in the same - * manner as non-huge pages. - */ - if (walk->hugetlb_entry && (vma->vm_start <= addr) && - is_vm_hugetlb_page(vma)) { - if (vma->vm_end < next) - next = vma->vm_end; - /* - * Hugepage is very tightly coupled with vma, - * so walk through hugetlb entries within a - * given vma. - */ - err = walk_hugetlb_range(vma, addr, next, walk); - if (err) - break; - pgd = pgd_offset(walk->mm, next); - continue; - } - } - - if (pgd_none_or_clear_bad(pgd)) { - if (walk->pte_hole) - err = walk->pte_hole(addr, next, walk); - if (err) + if (err < 0) break; - pgd++; - continue; } - if (walk->pmd_entry || walk->pte_entry) - err = walk_pud_range(pgd, addr, next, walk); + if (walk->vma || walk->pte_hole) + err = __walk_page_range(start, next, walk); if (err) break; - pgd++; - } while (addr = next, addr < end); - + } while (start = next, start < end); return err; } -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f42.google.com (mail-wg0-f42.google.com [74.125.82.42]) by kanga.kvack.org (Postfix) with ESMTP id 680446B006C for ; Tue, 1 Jul 2014 13:07:59 -0400 (EDT) Received: by mail-wg0-f42.google.com with SMTP id z12so9850831wgg.13 for ; Tue, 01 Jul 2014 10:07:58 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id z20si16038397wij.18.2014.07.01.10.07.57 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Jul 2014 10:07:57 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH v4 07/13] numa_maps: fix typo in gather_hugetbl_stats Date: Tue, 1 Jul 2014 13:07:25 -0400 Message-Id: <1404234451-21695-8-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Just doing s/gather_hugetbl_stats/gather_hugetlb_stats/g, this makes code grep-friendly. Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- fs/proc/task_mmu.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git v3.16-rc3.orig/fs/proc/task_mmu.c v3.16-rc3/fs/proc/task_mmu.c index 5ebc238d1a38..0d3d1ac32b2e 100644 --- v3.16-rc3.orig/fs/proc/task_mmu.c +++ v3.16-rc3/fs/proc/task_mmu.c @@ -1347,7 +1347,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, return 0; } #ifdef CONFIG_HUGETLB_PAGE -static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, +static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct numa_maps *md; @@ -1366,7 +1366,7 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, } #else -static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, +static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { return 0; @@ -1398,7 +1398,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) md->vma = vma; - walk.hugetlb_entry = gather_hugetbl_stats; + walk.hugetlb_entry = gather_hugetlb_stats; walk.pmd_entry = gather_pte_stats; walk.private = md; walk.mm = mm; -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com [209.85.212.178]) by kanga.kvack.org (Postfix) with ESMTP id E61B56B0070 for ; Tue, 1 Jul 2014 13:07:59 -0400 (EDT) Received: by mail-wi0-f178.google.com with SMTP id n15so8148774wiw.17 for ; Tue, 01 Jul 2014 10:07:59 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id gt4si16007800wib.64.2014.07.01.10.07.58 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Jul 2014 10:07:59 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH v4 12/13] mm: /proc/pid/clear_refs: avoid split_huge_page() Date: Tue, 1 Jul 2014 13:07:30 -0400 Message-Id: <1404234451-21695-13-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi , "Kirill A. Shutemov" , Pavel Emelyanov , Andrea Arcangeli , Cyrill Gorcunov From: "Kirill A. Shutemov" Currently pagewalker splits all THP pages on any clear_refs request. It's not necessary. We can handle this on PMD level. One side effect is that soft dirty will potentially see more dirty memory, since we will mark whole THP page dirty at once. Sanity checked with CRIU test suite. More testing is required. ChangeLog: - move code for thp to clear_refs_pte_range() Signed-off-by: Kirill A. Shutemov Cc: Pavel Emelyanov Cc: Andrea Arcangeli Cc: Dave Hansen Signed-off-by: Naoya Horiguchi Cc: Cyrill Gorcunov Signed-off-by: Andrew Morton --- fs/proc/task_mmu.c | 47 ++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 44 insertions(+), 3 deletions(-) diff --git v3.16-rc3.orig/fs/proc/task_mmu.c v3.16-rc3/fs/proc/task_mmu.c index 4ca28f401bb1..50518282ca2e 100644 --- v3.16-rc3.orig/fs/proc/task_mmu.c +++ v3.16-rc3/fs/proc/task_mmu.c @@ -718,10 +718,10 @@ struct clear_refs_private { enum clear_refs_types type; }; +#ifdef CONFIG_MEM_SOFT_DIRTY static inline void clear_soft_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *pte) { -#ifdef CONFIG_MEM_SOFT_DIRTY /* * The soft-dirty tracker uses #PF-s to catch writes * to pages, so write-protect the pte as well. See the @@ -740,9 +740,35 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, } set_pte_at(vma->vm_mm, addr, pte, ptent); -#endif } +static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmdp) +{ + pmd_t pmd = *pmdp; + + pmd = pmd_wrprotect(pmd); + pmd = pmd_clear_flags(pmd, _PAGE_SOFT_DIRTY); + + if (vma->vm_flags & VM_SOFTDIRTY) + vma->vm_flags &= ~VM_SOFTDIRTY; + + set_pmd_at(vma->vm_mm, addr, pmdp, pmd); +} + +#else + +static inline void clear_soft_dirty(struct vm_area_struct *vma, + unsigned long addr, pte_t *pte) +{ +} + +static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmdp) +{ +} +#endif + static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -752,7 +778,22 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, spinlock_t *ptl; struct page *page; - split_huge_page_pmd(vma, addr, pmd); + if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { + if (cp->type == CLEAR_REFS_SOFT_DIRTY) { + clear_soft_dirty_pmd(vma, addr, pmd); + goto out; + } + + page = pmd_page(*pmd); + + /* Clear accessed and referenced bits. */ + pmdp_test_and_clear_young(vma, addr, pmd); + ClearPageReferenced(page); +out: + spin_unlock(ptl); + return 0; + } + if (pmd_trans_unstable(pmd)) return 0; -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f181.google.com (mail-wi0-f181.google.com [209.85.212.181]) by kanga.kvack.org (Postfix) with ESMTP id C70736B0070 for ; Tue, 1 Jul 2014 13:08:05 -0400 (EDT) Received: by mail-wi0-f181.google.com with SMTP id n3so8134494wiv.8 for ; Tue, 01 Jul 2014 10:08:05 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id vn7si28795790wjc.45.2014.07.01.10.08.04 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Jul 2014 10:08:05 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH v4 13/13] mincore: apply page table walker on do_mincore() Date: Tue, 1 Jul 2014 13:07:31 -0400 Message-Id: <1404234451-21695-14-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi This patch makes do_mincore() use walk_page_vma(), which reduces many lines of code by using common page table walk code. ChangeLog v4: - remove redundant vma ChangeLog v3: - add NULL vma check in mincore_unmapped_range() - don't use pte_entry() ChangeLog v2: - change type of args of callbacks to void * - move definition of mincore_walk to the start of the function to fix compiler warning Signed-off-by: Naoya Horiguchi --- mm/huge_memory.c | 20 ------- mm/mincore.c | 173 ++++++++++++++++++++----------------------------------- 2 files changed, 62 insertions(+), 131 deletions(-) diff --git v3.16-rc3.orig/mm/huge_memory.c v3.16-rc3/mm/huge_memory.c index 33514d88fef9..63bed13c6cf5 100644 --- v3.16-rc3.orig/mm/huge_memory.c +++ v3.16-rc3/mm/huge_memory.c @@ -1410,26 +1410,6 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, return ret; } -int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, unsigned long end, - unsigned char *vec) -{ - spinlock_t *ptl; - int ret = 0; - - if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { - /* - * All logical pages in the range are present - * if backed by a huge page. - */ - spin_unlock(ptl); - memset(vec, 1, (end - addr) >> PAGE_SHIFT); - ret = 1; - } - - return ret; -} - int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma, unsigned long old_addr, unsigned long new_addr, unsigned long old_end, diff --git v3.16-rc3.orig/mm/mincore.c v3.16-rc3/mm/mincore.c index 725c80961048..3c64dcbcb3e2 100644 --- v3.16-rc3.orig/mm/mincore.c +++ v3.16-rc3/mm/mincore.c @@ -19,38 +19,26 @@ #include #include -static void mincore_hugetlb_page_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, - unsigned char *vec) +static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, + unsigned long end, struct mm_walk *walk) { + int err = 0; #ifdef CONFIG_HUGETLB_PAGE - struct hstate *h; + unsigned char present; + unsigned char *vec = walk->private; - h = hstate_vma(vma); - while (1) { - unsigned char present; - pte_t *ptep; - /* - * Huge pages are always in RAM for now, but - * theoretically it needs to be checked. - */ - ptep = huge_pte_offset(current->mm, - addr & huge_page_mask(h)); - present = ptep && !huge_pte_none(huge_ptep_get(ptep)); - while (1) { - *vec = present; - vec++; - addr += PAGE_SIZE; - if (addr == end) - return; - /* check hugepage border */ - if (!(addr & ~huge_page_mask(h))) - break; - } - } + /* + * Hugepages under user process are always in RAM and never + * swapped out, but theoretically it needs to be checked. + */ + present = pte && !huge_pte_none(huge_ptep_get(pte)); + for (; addr != end; vec++, addr += PAGE_SIZE) + *vec = present; + walk->private += (end - addr) >> PAGE_SHIFT; #else BUG(); #endif + return err; } /* @@ -94,14 +82,15 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) return present; } -static void mincore_unmapped_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, - unsigned char *vec) +static int mincore_unmapped_range(unsigned long addr, unsigned long end, + struct mm_walk *walk) { + struct vm_area_struct *vma = walk->vma; + unsigned char *vec = walk->private; unsigned long nr = (end - addr) >> PAGE_SHIFT; int i; - if (vma->vm_file) { + if (vma && vma->vm_file) { pgoff_t pgoff; pgoff = linear_page_index(vma, addr); @@ -111,25 +100,38 @@ static void mincore_unmapped_range(struct vm_area_struct *vma, for (i = 0; i < nr; i++) vec[i] = 0; } + walk->private += nr; + return 0; } -static void mincore_pte_range(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, unsigned long end, - unsigned char *vec) +static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, + struct mm_walk *walk) { - unsigned long next; spinlock_t *ptl; + struct vm_area_struct *vma = walk->vma; pte_t *ptep; - ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); - do { + if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { + memset(walk->private, 1, (end - addr) >> PAGE_SHIFT); + walk->private += (end - addr) >> PAGE_SHIFT; + spin_unlock(ptl); + return 0; + } + + if (pmd_trans_unstable(pmd)) + return 0; + + ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + for (; addr != end; ptep++, addr += PAGE_SIZE) { pte_t pte = *ptep; pgoff_t pgoff; + unsigned char *vec = walk->private; - next = addr + PAGE_SIZE; - if (pte_none(pte)) - mincore_unmapped_range(vma, addr, next, vec); - else if (pte_present(pte)) + if (pte_none(pte)) { + mincore_unmapped_range(addr, addr + PAGE_SIZE, walk); + continue; + } + if (pte_present(pte)) *vec = 1; else if (pte_file(pte)) { pgoff = pte_to_pgoff(pte); @@ -151,70 +153,11 @@ static void mincore_pte_range(struct vm_area_struct *vma, pmd_t *pmd, #endif } } - vec++; - } while (ptep++, addr = next, addr != end); + walk->private++; + } pte_unmap_unlock(ptep - 1, ptl); -} - -static void mincore_pmd_range(struct vm_area_struct *vma, pud_t *pud, - unsigned long addr, unsigned long end, - unsigned char *vec) -{ - unsigned long next; - pmd_t *pmd; - - pmd = pmd_offset(pud, addr); - do { - next = pmd_addr_end(addr, end); - if (pmd_trans_huge(*pmd)) { - if (mincore_huge_pmd(vma, pmd, addr, next, vec)) { - vec += (next - addr) >> PAGE_SHIFT; - continue; - } - /* fall through */ - } - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) - mincore_unmapped_range(vma, addr, next, vec); - else - mincore_pte_range(vma, pmd, addr, next, vec); - vec += (next - addr) >> PAGE_SHIFT; - } while (pmd++, addr = next, addr != end); -} - -static void mincore_pud_range(struct vm_area_struct *vma, pgd_t *pgd, - unsigned long addr, unsigned long end, - unsigned char *vec) -{ - unsigned long next; - pud_t *pud; - - pud = pud_offset(pgd, addr); - do { - next = pud_addr_end(addr, end); - if (pud_none_or_clear_bad(pud)) - mincore_unmapped_range(vma, addr, next, vec); - else - mincore_pmd_range(vma, pud, addr, next, vec); - vec += (next - addr) >> PAGE_SHIFT; - } while (pud++, addr = next, addr != end); -} - -static void mincore_page_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, - unsigned char *vec) -{ - unsigned long next; - pgd_t *pgd; - - pgd = pgd_offset(vma->vm_mm, addr); - do { - next = pgd_addr_end(addr, end); - if (pgd_none_or_clear_bad(pgd)) - mincore_unmapped_range(vma, addr, next, vec); - else - mincore_pud_range(vma, pgd, addr, next, vec); - vec += (next - addr) >> PAGE_SHIFT; - } while (pgd++, addr = next, addr != end); + cond_resched(); + return 0; } /* @@ -225,20 +168,28 @@ static void mincore_page_range(struct vm_area_struct *vma, static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec) { struct vm_area_struct *vma; - unsigned long end; + int err; + struct mm_walk mincore_walk = { + .pmd_entry = mincore_pte_range, + .pte_hole = mincore_unmapped_range, + .hugetlb_entry = mincore_hugetlb, + .private = vec, + }; vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; + mincore_walk.mm = vma->vm_mm; - end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); - - if (is_vm_hugetlb_page(vma)) - mincore_hugetlb_page_range(vma, addr, end, vec); - else - mincore_page_range(vma, addr, end, vec); + err = walk_page_vma(vma, &mincore_walk); + if (err < 0) + return err; + else { + unsigned long end; - return (end - addr) >> PAGE_SHIFT; + end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + return (end - addr) >> PAGE_SHIFT; + } } /* -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f47.google.com (mail-pa0-f47.google.com [209.85.220.47]) by kanga.kvack.org (Postfix) with ESMTP id 0E7896B0031 for ; Tue, 1 Jul 2014 17:00:40 -0400 (EDT) Received: by mail-pa0-f47.google.com with SMTP id kq14so11183907pab.6 for ; Tue, 01 Jul 2014 14:00:40 -0700 (PDT) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTP id le9si28087955pab.198.2014.07.01.14.00.39 for ; Tue, 01 Jul 2014 14:00:39 -0700 (PDT) Message-ID: <53B32170.1040707@intel.com> Date: Tue, 01 Jul 2014 14:00:32 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH v4 11/13] mempolicy: apply page table walker on queue_pages_range() References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1404234451-21695-12-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-12-git-send-email-n-horiguchi@ah.jp.nec.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Naoya Horiguchi , linux-mm@kvack.org Cc: Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi , Jet Chen On 07/01/2014 10:07 AM, Naoya Horiguchi wrote: > queue_pages_range() does page table walking in its own way now, but there > is some code duplicate. This patch applies page table walker to reduce > lines of code. > > queue_pages_range() has to do some precheck to determine whether we really > walk over the vma or just skip it. Now we have test_walk() callback in > mm_walk for this purpose, so we can do this replacement cleanly. > queue_pages_test_walk() depends on not only the current vma but also the > previous one, so queue_pages->prev is introduced to remember it. Hi Naoya, The previous version of this patch caused a performance regression which was reported to you: http://marc.info/?l=linux-kernel&m=140375975525069&w=2 Has that been dealt with in this version somehow? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f173.google.com (mail-wi0-f173.google.com [209.85.212.173]) by kanga.kvack.org (Postfix) with ESMTP id 87AC86B0035 for ; Tue, 1 Jul 2014 17:52:12 -0400 (EDT) Received: by mail-wi0-f173.google.com with SMTP id cc10so8603625wib.6 for ; Tue, 01 Jul 2014 14:52:12 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id v8si29632134wjq.85.2014.07.01.14.52.10 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Jul 2014 14:52:11 -0700 (PDT) Date: Tue, 1 Jul 2014 17:51:56 -0400 From: Naoya Horiguchi Subject: Re: [PATCH v4 11/13] mempolicy: apply page table walker on queue_pages_range() Message-ID: <20140701215156.GA21032@nhori.bos.redhat.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1404234451-21695-12-git-send-email-n-horiguchi@ah.jp.nec.com> <53B32170.1040707@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <53B32170.1040707@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: linux-mm@kvack.org, Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi , Jet Chen On Tue, Jul 01, 2014 at 02:00:32PM -0700, Dave Hansen wrote: > On 07/01/2014 10:07 AM, Naoya Horiguchi wrote: > > queue_pages_range() does page table walking in its own way now, but there > > is some code duplicate. This patch applies page table walker to reduce > > lines of code. > > > > queue_pages_range() has to do some precheck to determine whether we really > > walk over the vma or just skip it. Now we have test_walk() callback in > > mm_walk for this purpose, so we can do this replacement cleanly. > > queue_pages_test_walk() depends on not only the current vma but also the > > previous one, so queue_pages->prev is introduced to remember it. > > Hi Naoya, > > The previous version of this patch caused a performance regression which > was reported to you: > > http://marc.info/?l=linux-kernel&m=140375975525069&w=2 > > Has that been dealt with in this version somehow? I believe so, in previous version we called ->pte_entry() callback for each pte entries, but in this version I stop doing this and most of works are done in ->pmd_entry() callback, so the number of function calls are reduced by about 1/512. And rather than that, I just cleaned up queue_pages_* without major behavioral changes, so the visible regression should be solved. Thanks, Naoya Horiguchi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-la0-f46.google.com (mail-la0-f46.google.com [209.85.215.46]) by kanga.kvack.org (Postfix) with ESMTP id BC7976B0037 for ; Wed, 9 Jul 2014 09:34:52 -0400 (EDT) Received: by mail-la0-f46.google.com with SMTP id el20so4986741lab.5 for ; Wed, 09 Jul 2014 06:34:51 -0700 (PDT) Received: from jenni2.inet.fi (mta-out1.inet.fi. [62.71.2.199]) by mx.google.com with ESMTP id h7si7315006laa.109.2014.07.09.06.34.51 for ; Wed, 09 Jul 2014 06:34:51 -0700 (PDT) Date: Wed, 9 Jul 2014 16:34:36 +0300 From: "Kirill A. Shutemov" Subject: Re: [PATCH v4 13/13] mincore: apply page table walker on do_mincore() Message-ID: <20140709133436.GA18391@node.dhcp.inet.fi> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1404234451-21695-14-git-send-email-n-horiguchi@ah.jp.nec.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1404234451-21695-14-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: Naoya Horiguchi Cc: linux-mm@kvack.org, Andrew Morton , Dave Hansen , Hugh Dickins , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi On Tue, Jul 01, 2014 at 01:07:31PM -0400, Naoya Horiguchi wrote: > This patch makes do_mincore() use walk_page_vma(), which reduces many lines > of code by using common page table walk code. > > ChangeLog v4: > - remove redundant vma > > ChangeLog v3: > - add NULL vma check in mincore_unmapped_range() > - don't use pte_entry() > > ChangeLog v2: > - change type of args of callbacks to void * > - move definition of mincore_walk to the start of the function to fix compiler > warning > > Signed-off-by: Naoya Horiguchi Trinity crases this implementation of mincore pretty easily: [ 42.775369] BUG: unable to handle kernel paging request at ffff88007bb61000 [ 42.776656] IP: [] mincore_unmapped_range+0xdf/0x100 [ 42.777560] PGD 2ef6067 PUD 87fa01067 PMD 87f823067 PTE 800000007bb61060 [ 42.778529] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC [ 42.779106] Modules linked in: [ 42.779106] CPU: 0 PID: 917 Comm: trinity-c27 Not tainted 3.16.0-rc4-next-20140709-00013-g28e4629f71a8 #1450 [ 42.779106] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 42.779106] task: ffff880852e98110 ti: ffff880844024000 task.ti: ffff880844024000 [ 42.779106] RIP: 0010:[] [] mincore_unmapped_range+0xdf/0x100 [ 42.779106] RSP: 0018:ffff880844027df0 EFLAGS: 00010202 [ 42.779106] RAX: 000000000000001c RBX: 00007fc300000000 RCX: 00003ffffffff000 [ 42.779106] RDX: 000000000000001b RSI: ffff88007bb60fe5 RDI: 00007fc2c2c00000 [ 42.779106] RBP: ffff880844027e28 R08: 00007fc2c2e00000 R09: 0000000000000000 [ 42.779106] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000200 [ 42.779106] R13: ffff88007bb60fe5 R14: ffff880855a80018 R15: 00007fc2c2c00000 [ 42.779106] FS: 00007fc345666700(0000) GS:ffff880859600000(0000) knlGS:0000000000000000 [ 42.779106] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 42.779106] CR2: ffff88007bb61000 CR3: 0000000852dfd000 CR4: 00000000000006f0 [ 42.779106] Stack: [ 42.779106] ffff880844027f10 ffff88007bb60fe5 00007fc300000000 00007fc2c2e00000 [ 42.779106] 00007fc2c1e1b000 ffff880844027f10 00007fc2c2c00000 ffff880844027eb8 [ 42.779106] ffffffff81135bfe 00007fc341c1bfff ffff880000000000 ffff880852dfd7f8 [ 42.779106] Call Trace: [ 42.779106] [] __walk_page_range+0x1ae/0x450 [ 42.779106] [] walk_page_vma+0x71/0x90 [ 42.779106] [] SyS_mincore+0x1de/0x270 [ 42.779106] [] ? trace_hardirqs_on+0xd/0x10 [ 42.779106] [] ? mincore_unmapped_range+0x100/0x100 [ 42.779106] [] ? mincore_page+0xa0/0xa0 [ 42.779106] [] ? handle_mm_fault+0xd30/0xd30 [ 42.779106] [] system_call_fastpath+0x16/0x1b [ 42.779106] Code: 83 c4 10 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 40 00 31 d2 31 c0 4d 85 e4 4c 8b 6d d0 74 d3 0f 1f 00 48 8b 75 d0 83 c0 01 04 16 00 48 63 d0 49 39 d4 77 ed eb b3 48 89 fe 4c 89 f7 e8 [ 42.779106] RIP [] mincore_unmapped_range+0xdf/0x100 [ 42.779106] RSP [ 42.779106] CR2: ffff88007bb61000 [ 42.779106] ---[ end trace 3fac62521b6b0cb0 ]--- [ 42.779106] Kernel panic - not syncing: Fatal exception [ 42.779106] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff) Looks like 'vec' overflow. I don't see what could prevent do_mincore() to write more than PAGE_SIZE to 'vec'. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f180.google.com (mail-ig0-f180.google.com [209.85.213.180]) by kanga.kvack.org (Postfix) with ESMTP id 0F11B900002 for ; Wed, 9 Jul 2014 17:36:35 -0400 (EDT) Received: by mail-ig0-f180.google.com with SMTP id l13so1990085iga.7 for ; Wed, 09 Jul 2014 14:36:34 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id k4si10024343igx.63.2014.07.09.14.36.33 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jul 2014 14:36:34 -0700 (PDT) Date: Wed, 9 Jul 2014 17:36:24 -0400 From: Naoya Horiguchi Subject: Re: [PATCH v4 13/13] mincore: apply page table walker on do_mincore() Message-ID: <20140709213624.GC24698@nhori> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1404234451-21695-14-git-send-email-n-horiguchi@ah.jp.nec.com> <20140709133436.GA18391@node.dhcp.inet.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140709133436.GA18391@node.dhcp.inet.fi> Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" Cc: linux-mm@kvack.org, Andrew Morton , Dave Hansen , Hugh Dickins , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi On Wed, Jul 09, 2014 at 04:34:36PM +0300, Kirill A. Shutemov wrote: > On Tue, Jul 01, 2014 at 01:07:31PM -0400, Naoya Horiguchi wrote: > > This patch makes do_mincore() use walk_page_vma(), which reduces many lines > > of code by using common page table walk code. > > > > ChangeLog v4: > > - remove redundant vma > > > > ChangeLog v3: > > - add NULL vma check in mincore_unmapped_range() > > - don't use pte_entry() > > > > ChangeLog v2: > > - change type of args of callbacks to void * > > - move definition of mincore_walk to the start of the function to fix compiler > > warning > > > > Signed-off-by: Naoya Horiguchi > > Trinity crases this implementation of mincore pretty easily: > > [ 42.775369] BUG: unable to handle kernel paging request at ffff88007bb61000 > [ 42.776656] IP: [] mincore_unmapped_range+0xdf/0x100 Thanks for your testing/reporting. ... > > Looks like 'vec' overflow. I don't see what could prevent do_mincore() to > write more than PAGE_SIZE to 'vec'. I found the miscalculation of walk->private (vec) on thp and hugetlbfs. I confirmed that the reported problem is fixed (I checked that trinity never triggers the reported BUG) with the following changes on this patch. diff --git a/mm/mincore.c b/mm/mincore.c index 3c64dcbcb3e2..9eb10d867a6f 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -34,7 +34,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, present = pte && !huge_pte_none(huge_ptep_get(pte)); for (; addr != end; vec++, addr += PAGE_SIZE) *vec = present; - walk->private += (end - addr) >> PAGE_SHIFT; + walk->private = vec; #else BUG(); #endif @@ -118,8 +118,10 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd)) { + walk->private += (end - addr) >> PAGE_SHIFT; return 0; + } ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); for (; addr != end; ptep++, addr += PAGE_SIZE) { Thanks, Naoya Horiguchi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com [209.85.212.179]) by kanga.kvack.org (Postfix) with ESMTP id D6B666B0031 for ; Thu, 10 Jul 2014 06:06:15 -0400 (EDT) Received: by mail-wi0-f179.google.com with SMTP id cc10so4175107wib.12 for ; Thu, 10 Jul 2014 03:06:15 -0700 (PDT) Received: from jenni2.inet.fi (mta-out1.inet.fi. [62.71.2.199]) by mx.google.com with ESMTP id j4si61584956wja.141.2014.07.10.03.06.14 for ; Thu, 10 Jul 2014 03:06:14 -0700 (PDT) Date: Thu, 10 Jul 2014 13:06:00 +0300 From: "Kirill A. Shutemov" Subject: Re: [PATCH v4 13/13] mincore: apply page table walker on do_mincore() Message-ID: <20140710100600.GA30360@node.dhcp.inet.fi> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1404234451-21695-14-git-send-email-n-horiguchi@ah.jp.nec.com> <20140709133436.GA18391@node.dhcp.inet.fi> <20140709213624.GC24698@nhori> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140709213624.GC24698@nhori> Sender: owner-linux-mm@kvack.org List-ID: To: Naoya Horiguchi Cc: linux-mm@kvack.org, Andrew Morton , Dave Hansen , Hugh Dickins , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi On Wed, Jul 09, 2014 at 05:36:24PM -0400, Naoya Horiguchi wrote: > On Wed, Jul 09, 2014 at 04:34:36PM +0300, Kirill A. Shutemov wrote: > > On Tue, Jul 01, 2014 at 01:07:31PM -0400, Naoya Horiguchi wrote: > > > This patch makes do_mincore() use walk_page_vma(), which reduces many lines > > > of code by using common page table walk code. > > > > > > ChangeLog v4: > > > - remove redundant vma > > > > > > ChangeLog v3: > > > - add NULL vma check in mincore_unmapped_range() > > > - don't use pte_entry() > > > > > > ChangeLog v2: > > > - change type of args of callbacks to void * > > > - move definition of mincore_walk to the start of the function to fix compiler > > > warning > > > > > > Signed-off-by: Naoya Horiguchi > > > > Trinity crases this implementation of mincore pretty easily: > > > > [ 42.775369] BUG: unable to handle kernel paging request at ffff88007bb61000 > > [ 42.776656] IP: [] mincore_unmapped_range+0xdf/0x100 > > Thanks for your testing/reporting. > > ... > > > > Looks like 'vec' overflow. I don't see what could prevent do_mincore() to > > write more than PAGE_SIZE to 'vec'. > > I found the miscalculation of walk->private (vec) on thp and hugetlbfs. > I confirmed that the reported problem is fixed (I checked that trinity > never triggers the reported BUG) with the following changes on this patch. With the changes: [ 26.850945] BUG: unable to handle kernel paging request at ffff880852d8c000 [ 26.852718] IP: [] mincore_hugetlb+0x27/0x50 [ 26.853527] PGD 2ef6067 PUD 2ef9067 PMD 87fd4a067 PTE 8000000852d8c060 [ 26.854462] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC [ 26.854752] Modules linked in: [ 26.854752] CPU: 5 PID: 170 Comm: trinity-c5 Not tainted 3.16.0-rc4-next-20140709-00013-g28e4629f71a8-dirty #1453 [ 26.854752] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 26.854752] task: ffff880852d22890 ti: ffff880852d24000 task.ti: ffff880852d24000 [ 26.854752] RIP: 0010:[] [] mincore_hugetlb+0x27/0x50 [ 26.854752] RSP: 0018:ffff880852d27e28 EFLAGS: 00010206 [ 26.854752] RAX: ffff880852d8c000 RBX: 00007f9fb2200000 RCX: 00007f9fb2200000 [ 26.854752] RDX: 00007f9fb2001000 RSI: ffffffffffe00000 RDI: ffff88084f3edc80 [ 26.854752] RBP: ffff880852d27e28 R08: ffff880852d27f10 R09: ffffffff81126dc0 [ 26.854752] R10: 0000000000000000 R11: 0000000000000001 R12: 00007f9fde000000 [ 26.854752] R13: ffffffff82e32580 R14: 00007f9fb2000000 R15: ffff880852d27f10 [ 26.854752] FS: 00007f9fe1bde700(0000) GS:ffff88085a000000(0000) knlGS:0000000000000000 [ 26.854752] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 26.854752] CR2: ffff880852d8c000 CR3: 0000000852d12000 CR4: 00000000000006e0 [ 26.854752] Stack: [ 26.854752] ffff880852d27eb8 ffffffff81135e24 ffff880852ce01d8 0000000000000282 [ 26.854752] ffff880852ce01d8 ffff880852d22890 00007f9fde000000 ffff880852d27eb0 [ 26.854752] 0000000000000282 0000000000000000 ffffffff81127399 0000000000000282 [ 26.854752] Call Trace: [ 26.854752] [] __walk_page_range+0x3f4/0x450 [ 26.854752] [] ? SyS_mincore+0x179/0x270 [ 26.854752] [] walk_page_vma+0x71/0x90 [ 26.854752] [] SyS_mincore+0x1de/0x270 [ 26.854752] [] ? mincore_unmapped_range+0x100/0x100 [ 26.854752] [] ? mincore_page+0xa0/0xa0 [ 26.854752] [] ? handle_mm_fault+0xd30/0xd30 [ 26.854752] [] system_call_fastpath+0x16/0x1b [ 26.854752] Code: 0f 1f 40 00 55 48 85 ff 49 8b 40 38 48 89 e5 74 33 48 83 3f 00 40 0f 95 c6 48 39 ca 74 19 66 0f 1f 44 00 00 48 81 c2 00 10 00 00 <40> 88 30 48 83 c0 01 48 39 d1 75 ed 49 89 40 38 31 c0 5d c3 0f [ 26.854752] RIP [] mincore_hugetlb+0x27/0x50 [ 26.854752] RSP [ 26.854752] CR2: ffff880852d8c000 [ 26.854752] ---[ end trace 536bbdef8c6d5b03 ]--- Could you explain to me how you protect 'vec' from being overflowed? I don't any code for that. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f44.google.com (mail-wg0-f44.google.com [74.125.82.44]) by kanga.kvack.org (Postfix) with ESMTP id 5A9336B0031 for ; Thu, 10 Jul 2014 07:32:33 -0400 (EDT) Received: by mail-wg0-f44.google.com with SMTP id m15so310913wgh.3 for ; Thu, 10 Jul 2014 04:32:32 -0700 (PDT) Received: from jenni1.inet.fi (mta-out1.inet.fi. [62.71.2.198]) by mx.google.com with ESMTP id wr4si32622662wjb.15.2014.07.10.04.32.32 for ; Thu, 10 Jul 2014 04:32:32 -0700 (PDT) Date: Thu, 10 Jul 2014 14:32:19 +0300 From: "Kirill A. Shutemov" Subject: Re: [PATCH v4 05/13] clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk() Message-ID: <20140710113219.GA30954@node.dhcp.inet.fi> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1404234451-21695-6-git-send-email-n-horiguchi@ah.jp.nec.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1404234451-21695-6-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: Naoya Horiguchi Cc: linux-mm@kvack.org, Andrew Morton , Dave Hansen , Hugh Dickins , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi On Tue, Jul 01, 2014 at 01:07:23PM -0400, Naoya Horiguchi wrote: > @@ -822,38 +844,14 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, > }; > struct mm_walk clear_refs_walk = { > .pmd_entry = clear_refs_pte_range, > + .test_walk = clear_refs_test_walk, > .mm = mm, > .private = &cp, > }; > down_read(&mm->mmap_sem); > if (type == CLEAR_REFS_SOFT_DIRTY) > mmu_notifier_invalidate_range_start(mm, 0, -1); > - for (vma = mm->mmap; vma; vma = vma->vm_next) { > - cp.vma = vma; > - if (is_vm_hugetlb_page(vma)) > - continue; > - /* > - * Writing 1 to /proc/pid/clear_refs affects all pages. > - * > - * Writing 2 to /proc/pid/clear_refs only affects > - * Anonymous pages. > - * > - * Writing 3 to /proc/pid/clear_refs only affects file > - * mapped pages. > - * > - * Writing 4 to /proc/pid/clear_refs affects all pages. > - */ > - if (type == CLEAR_REFS_ANON && vma->vm_file) > - continue; > - if (type == CLEAR_REFS_MAPPED && !vma->vm_file) > - continue; > - if (type == CLEAR_REFS_SOFT_DIRTY) { > - if (vma->vm_flags & VM_SOFTDIRTY) > - vma->vm_flags &= ~VM_SOFTDIRTY; > - } > - walk_page_range(vma->vm_start, vma->vm_end, > - &clear_refs_walk); > - } > + walk_page_range(0, ~0UL, &clear_refs_walk); 'vma' variable is now unused in the clear_refs_write(). -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qa0-f51.google.com (mail-qa0-f51.google.com [209.85.216.51]) by kanga.kvack.org (Postfix) with ESMTP id 9B0506B0031 for ; Thu, 10 Jul 2014 09:28:11 -0400 (EDT) Received: by mail-qa0-f51.google.com with SMTP id k15so225555qaq.38 for ; Thu, 10 Jul 2014 06:28:11 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id c93si64620126qgf.89.2014.07.10.06.28.09 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 10 Jul 2014 06:28:10 -0700 (PDT) Date: Thu, 10 Jul 2014 09:27:17 -0400 From: Naoya Horiguchi Subject: Re: [PATCH v4 05/13] clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk() Message-ID: <20140710132717.GA12391@nhori> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1404234451-21695-6-git-send-email-n-horiguchi@ah.jp.nec.com> <20140710113219.GA30954@node.dhcp.inet.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140710113219.GA30954@node.dhcp.inet.fi> Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" Cc: linux-mm@kvack.org, Andrew Morton , Dave Hansen , Hugh Dickins , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi On Thu, Jul 10, 2014 at 02:32:19PM +0300, Kirill A. Shutemov wrote: > On Tue, Jul 01, 2014 at 01:07:23PM -0400, Naoya Horiguchi wrote: > > @@ -822,38 +844,14 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, > > }; > > struct mm_walk clear_refs_walk = { > > .pmd_entry = clear_refs_pte_range, > > + .test_walk = clear_refs_test_walk, > > .mm = mm, > > .private = &cp, > > }; > > down_read(&mm->mmap_sem); > > if (type == CLEAR_REFS_SOFT_DIRTY) > > mmu_notifier_invalidate_range_start(mm, 0, -1); > > - for (vma = mm->mmap; vma; vma = vma->vm_next) { > > - cp.vma = vma; > > - if (is_vm_hugetlb_page(vma)) > > - continue; > > - /* > > - * Writing 1 to /proc/pid/clear_refs affects all pages. > > - * > > - * Writing 2 to /proc/pid/clear_refs only affects > > - * Anonymous pages. > > - * > > - * Writing 3 to /proc/pid/clear_refs only affects file > > - * mapped pages. > > - * > > - * Writing 4 to /proc/pid/clear_refs affects all pages. > > - */ > > - if (type == CLEAR_REFS_ANON && vma->vm_file) > > - continue; > > - if (type == CLEAR_REFS_MAPPED && !vma->vm_file) > > - continue; > > - if (type == CLEAR_REFS_SOFT_DIRTY) { > > - if (vma->vm_flags & VM_SOFTDIRTY) > > - vma->vm_flags &= ~VM_SOFTDIRTY; > > - } > > - walk_page_range(vma->vm_start, vma->vm_end, > > - &clear_refs_walk); > > - } > > + walk_page_range(0, ~0UL, &clear_refs_walk); > > 'vma' variable is now unused in the clear_refs_write(). Yes, will remove it. Naoya -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vc0-f181.google.com (mail-vc0-f181.google.com [209.85.220.181]) by kanga.kvack.org (Postfix) with ESMTP id 230076B0036 for ; Thu, 10 Jul 2014 12:36:11 -0400 (EDT) Received: by mail-vc0-f181.google.com with SMTP id il7so11141300vcb.12 for ; Thu, 10 Jul 2014 09:36:10 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id dx8si23151864vdb.24.2014.07.10.09.36.09 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 10 Jul 2014 09:36:10 -0700 (PDT) Date: Thu, 10 Jul 2014 12:35:55 -0400 From: Naoya Horiguchi Subject: Re: [PATCH v4 13/13] mincore: apply page table walker on do_mincore() Message-ID: <20140710163555.GB12391@nhori> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1404234451-21695-14-git-send-email-n-horiguchi@ah.jp.nec.com> <20140709133436.GA18391@node.dhcp.inet.fi> <20140709213624.GC24698@nhori> <20140710100600.GA30360@node.dhcp.inet.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140710100600.GA30360@node.dhcp.inet.fi> Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" Cc: linux-mm@kvack.org, Andrew Morton , Dave Hansen , Hugh Dickins , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi On Thu, Jul 10, 2014 at 01:06:00PM +0300, Kirill A. Shutemov wrote: > On Wed, Jul 09, 2014 at 05:36:24PM -0400, Naoya Horiguchi wrote: > > On Wed, Jul 09, 2014 at 04:34:36PM +0300, Kirill A. Shutemov wrote: > > > On Tue, Jul 01, 2014 at 01:07:31PM -0400, Naoya Horiguchi wrote: > > > > This patch makes do_mincore() use walk_page_vma(), which reduces many lines > > > > of code by using common page table walk code. > > > > > > > > ChangeLog v4: > > > > - remove redundant vma > > > > > > > > ChangeLog v3: > > > > - add NULL vma check in mincore_unmapped_range() > > > > - don't use pte_entry() > > > > > > > > ChangeLog v2: > > > > - change type of args of callbacks to void * > > > > - move definition of mincore_walk to the start of the function to fix compiler > > > > warning > > > > > > > > Signed-off-by: Naoya Horiguchi > > > > > > Trinity crases this implementation of mincore pretty easily: > > > > > > [ 42.775369] BUG: unable to handle kernel paging request at ffff88007bb61000 > > > [ 42.776656] IP: [] mincore_unmapped_range+0xdf/0x100 > > > > Thanks for your testing/reporting. > > > > ... > > > > > > Looks like 'vec' overflow. I don't see what could prevent do_mincore() to > > > write more than PAGE_SIZE to 'vec'. > > > > I found the miscalculation of walk->private (vec) on thp and hugetlbfs. > > I confirmed that the reported problem is fixed (I checked that trinity > > never triggers the reported BUG) with the following changes on this patch. > > With the changes: > > [ 26.850945] BUG: unable to handle kernel paging request at ffff880852d8c000 > [ 26.852718] IP: [] mincore_hugetlb+0x27/0x50 > [ 26.853527] PGD 2ef6067 PUD 2ef9067 PMD 87fd4a067 PTE 8000000852d8c060 > [ 26.854462] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC > [ 26.854752] Modules linked in: > [ 26.854752] CPU: 5 PID: 170 Comm: trinity-c5 Not tainted 3.16.0-rc4-next-20140709-00013-g28e4629f71a8-dirty #1453 > [ 26.854752] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > [ 26.854752] task: ffff880852d22890 ti: ffff880852d24000 task.ti: ffff880852d24000 > [ 26.854752] RIP: 0010:[] [] mincore_hugetlb+0x27/0x50 > [ 26.854752] RSP: 0018:ffff880852d27e28 EFLAGS: 00010206 > [ 26.854752] RAX: ffff880852d8c000 RBX: 00007f9fb2200000 RCX: 00007f9fb2200000 > [ 26.854752] RDX: 00007f9fb2001000 RSI: ffffffffffe00000 RDI: ffff88084f3edc80 > [ 26.854752] RBP: ffff880852d27e28 R08: ffff880852d27f10 R09: ffffffff81126dc0 > [ 26.854752] R10: 0000000000000000 R11: 0000000000000001 R12: 00007f9fde000000 > [ 26.854752] R13: ffffffff82e32580 R14: 00007f9fb2000000 R15: ffff880852d27f10 > [ 26.854752] FS: 00007f9fe1bde700(0000) GS:ffff88085a000000(0000) knlGS:0000000000000000 > [ 26.854752] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 26.854752] CR2: ffff880852d8c000 CR3: 0000000852d12000 CR4: 00000000000006e0 > [ 26.854752] Stack: > [ 26.854752] ffff880852d27eb8 ffffffff81135e24 ffff880852ce01d8 0000000000000282 > [ 26.854752] ffff880852ce01d8 ffff880852d22890 00007f9fde000000 ffff880852d27eb0 > [ 26.854752] 0000000000000282 0000000000000000 ffffffff81127399 0000000000000282 > [ 26.854752] Call Trace: > [ 26.854752] [] __walk_page_range+0x3f4/0x450 > [ 26.854752] [] ? SyS_mincore+0x179/0x270 > [ 26.854752] [] walk_page_vma+0x71/0x90 > [ 26.854752] [] SyS_mincore+0x1de/0x270 > [ 26.854752] [] ? mincore_unmapped_range+0x100/0x100 > [ 26.854752] [] ? mincore_page+0xa0/0xa0 > [ 26.854752] [] ? handle_mm_fault+0xd30/0xd30 > [ 26.854752] [] system_call_fastpath+0x16/0x1b > [ 26.854752] Code: 0f 1f 40 00 55 48 85 ff 49 8b 40 38 48 89 e5 74 33 48 83 3f 00 40 0f 95 c6 48 39 ca 74 19 66 0f 1f 44 00 00 48 81 c2 00 10 00 00 <40> 88 30 48 83 c0 01 48 39 d1 75 ed 49 89 40 38 31 c0 5d c3 0f > [ 26.854752] RIP [] mincore_hugetlb+0x27/0x50 > [ 26.854752] RSP > [ 26.854752] CR2: ffff880852d8c000 > [ 26.854752] ---[ end trace 536bbdef8c6d5b03 ]--- > > Could you explain to me how you protect 'vec' from being overflowed? I don't > any code for that. I don't do it explicitly, so adding it is one solution. But I think the problem comes from using walk_page_range() instead of walk_page_vma() which forcibly sets the walk range from vm->vm_start to vm->vm_end. As the original code does, limiting the range to [addr, addr + pages << PAGE_SHIFT) is fine because it implicitly prevents buffer overflow. Here is the revised fix for this patch. Please remove the one I replied yesterday because it was wrong. Thanks, Naoya Horiguchi --- diff --git a/mm/mincore.c b/mm/mincore.c index 3c64dcbcb3e2..0e548fbce19e 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -34,7 +34,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, present = pte && !huge_pte_none(huge_ptep_get(pte)); for (; addr != end; vec++, addr += PAGE_SIZE) *vec = present; - walk->private += (end - addr) >> PAGE_SHIFT; + walk->private = vec; #else BUG(); #endif @@ -118,8 +118,10 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd)) { + mincore_unmapped_range(addr, end, walk); return 0; + } ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); for (; addr != end; ptep++, addr += PAGE_SIZE) { @@ -168,6 +170,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec) { struct vm_area_struct *vma; + unsigned long end; int err; struct mm_walk mincore_walk = { .pmd_entry = mincore_pte_range, @@ -180,16 +183,11 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v if (!vma || addr < vma->vm_start) return -ENOMEM; mincore_walk.mm = vma->vm_mm; - - err = walk_page_vma(vma, &mincore_walk); + end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + err = walk_page_range(addr, end, &mincore_walk); if (err < 0) return err; - else { - unsigned long end; - - end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); - return (end - addr) >> PAGE_SHIFT; - } + return (end - addr) >> PAGE_SHIFT; } /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758583AbaGARHw (ORCPT ); Tue, 1 Jul 2014 13:07:52 -0400 Received: from mx1.redhat.com ([209.132.183.28]:5301 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758555AbaGARHu (ORCPT ); Tue, 1 Jul 2014 13:07:50 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: [PATCH v4 00/13] pagewalk: improve vma handling, apply to new users Date: Tue, 1 Jul 2014 13:07:18 -0400 Message-Id: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This series is ver.4 of page table walker patchset. I reflected comments in the previous version (thanks Kirill, Jerome). And rebased onto v3.16-rc3. Recently code around queue_pages_range() is changed by commit d05f0cdcbe63 "mm: fix crashes from mbind() merging vma", which affects this series a little. Thanks, Naoya Horiguchi Tree: git@github.com:Naoya-Horiguchi/linux.git Branch: v3.16-rc3/page_table_walker.ver4 --- Summary: Kirill A. Shutemov (1): mm: /proc/pid/clear_refs: avoid split_huge_page() Naoya Horiguchi (12): mm/pagewalk: remove pgd_entry() and pud_entry() pagewalk: improve vma handling pagewalk: add walk_page_vma() smaps: remove mem_size_stats->vma and use walk_page_vma() clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk() pagemap: use walk->vma instead of calling find_vma() numa_maps: fix typo in gather_hugetbl_stats numa_maps: remove numa_maps->vma memcg: cleanup preparation for page table walk arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma() mempolicy: apply page table walker on queue_pages_range() mincore: apply page table walker on do_mincore() arch/powerpc/mm/subpage-prot.c | 6 +- fs/proc/task_mmu.c | 150 ++++++++++++++++----------- include/linux/mm.h | 22 ++-- mm/huge_memory.c | 20 ---- mm/memcontrol.c | 49 +++------ mm/mempolicy.c | 224 ++++++++++++++++------------------------ mm/mincore.c | 173 +++++++++++-------------------- mm/pagewalk.c | 228 ++++++++++++++++++++++++----------------- 8 files changed, 409 insertions(+), 463 deletions(-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758642AbaGARIB (ORCPT ); Tue, 1 Jul 2014 13:08:01 -0400 Received: from mx1.redhat.com ([209.132.183.28]:3799 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758555AbaGARHz (ORCPT ); Tue, 1 Jul 2014 13:07:55 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: [PATCH v4 05/13] clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk() Date: Tue, 1 Jul 2014 13:07:23 -0400 Message-Id: <1404234451-21695-6-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org clear_refs_write() has some prechecks to determine if we really walk over a given vma. Now we have a test_walk() callback to filter vmas, so let's utilize it. ChangeLog v4: - use walk_page_range instead of walk_page_vma with for loop Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- fs/proc/task_mmu.c | 54 ++++++++++++++++++++++++++---------------------------- 1 file changed, 26 insertions(+), 28 deletions(-) diff --git v3.16-rc3.orig/fs/proc/task_mmu.c v3.16-rc3/fs/proc/task_mmu.c index 3067bf08393b..df9f368e01b7 100644 --- v3.16-rc3.orig/fs/proc/task_mmu.c +++ v3.16-rc3/fs/proc/task_mmu.c @@ -715,7 +715,6 @@ enum clear_refs_types { }; struct clear_refs_private { - struct vm_area_struct *vma; enum clear_refs_types type; }; @@ -748,7 +747,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct clear_refs_private *cp = walk->private; - struct vm_area_struct *vma = cp->vma; + struct vm_area_struct *vma = walk->vma; pte_t *pte, ptent; spinlock_t *ptl; struct page *page; @@ -782,6 +781,29 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, return 0; } +static int clear_refs_test_walk(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + struct clear_refs_private *cp = walk->private; + struct vm_area_struct *vma = walk->vma; + + /* + * Writing 1 to /proc/pid/clear_refs affects all pages. + * Writing 2 to /proc/pid/clear_refs only affects anonymous pages. + * Writing 3 to /proc/pid/clear_refs only affects file mapped pages. + * Writing 4 to /proc/pid/clear_refs affects all pages. + */ + if (cp->type == CLEAR_REFS_ANON && vma->vm_file) + return 1; + if (cp->type == CLEAR_REFS_MAPPED && !vma->vm_file) + return 1; + if (cp->type == CLEAR_REFS_SOFT_DIRTY) { + if (vma->vm_flags & VM_SOFTDIRTY) + vma->vm_flags &= ~VM_SOFTDIRTY; + } + return 0; +} + static ssize_t clear_refs_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { @@ -822,38 +844,14 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, }; struct mm_walk clear_refs_walk = { .pmd_entry = clear_refs_pte_range, + .test_walk = clear_refs_test_walk, .mm = mm, .private = &cp, }; down_read(&mm->mmap_sem); if (type == CLEAR_REFS_SOFT_DIRTY) mmu_notifier_invalidate_range_start(mm, 0, -1); - for (vma = mm->mmap; vma; vma = vma->vm_next) { - cp.vma = vma; - if (is_vm_hugetlb_page(vma)) - continue; - /* - * Writing 1 to /proc/pid/clear_refs affects all pages. - * - * Writing 2 to /proc/pid/clear_refs only affects - * Anonymous pages. - * - * Writing 3 to /proc/pid/clear_refs only affects file - * mapped pages. - * - * Writing 4 to /proc/pid/clear_refs affects all pages. - */ - if (type == CLEAR_REFS_ANON && vma->vm_file) - continue; - if (type == CLEAR_REFS_MAPPED && !vma->vm_file) - continue; - if (type == CLEAR_REFS_SOFT_DIRTY) { - if (vma->vm_flags & VM_SOFTDIRTY) - vma->vm_flags &= ~VM_SOFTDIRTY; - } - walk_page_range(vma->vm_start, vma->vm_end, - &clear_refs_walk); - } + walk_page_range(0, ~0UL, &clear_refs_walk); if (type == CLEAR_REFS_SOFT_DIRTY) mmu_notifier_invalidate_range_end(mm, 0, -1); flush_tlb_mm(mm); -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932417AbaGARIK (ORCPT ); Tue, 1 Jul 2014 13:08:10 -0400 Received: from mx1.redhat.com ([209.132.183.28]:46092 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932388AbaGARIH (ORCPT ); Tue, 1 Jul 2014 13:08:07 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: [PATCH v4 13/13] mincore: apply page table walker on do_mincore() Date: Tue, 1 Jul 2014 13:07:31 -0400 Message-Id: <1404234451-21695-14-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch makes do_mincore() use walk_page_vma(), which reduces many lines of code by using common page table walk code. ChangeLog v4: - remove redundant vma ChangeLog v3: - add NULL vma check in mincore_unmapped_range() - don't use pte_entry() ChangeLog v2: - change type of args of callbacks to void * - move definition of mincore_walk to the start of the function to fix compiler warning Signed-off-by: Naoya Horiguchi --- mm/huge_memory.c | 20 ------- mm/mincore.c | 173 ++++++++++++++++++++----------------------------------- 2 files changed, 62 insertions(+), 131 deletions(-) diff --git v3.16-rc3.orig/mm/huge_memory.c v3.16-rc3/mm/huge_memory.c index 33514d88fef9..63bed13c6cf5 100644 --- v3.16-rc3.orig/mm/huge_memory.c +++ v3.16-rc3/mm/huge_memory.c @@ -1410,26 +1410,6 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, return ret; } -int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, unsigned long end, - unsigned char *vec) -{ - spinlock_t *ptl; - int ret = 0; - - if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { - /* - * All logical pages in the range are present - * if backed by a huge page. - */ - spin_unlock(ptl); - memset(vec, 1, (end - addr) >> PAGE_SHIFT); - ret = 1; - } - - return ret; -} - int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma, unsigned long old_addr, unsigned long new_addr, unsigned long old_end, diff --git v3.16-rc3.orig/mm/mincore.c v3.16-rc3/mm/mincore.c index 725c80961048..3c64dcbcb3e2 100644 --- v3.16-rc3.orig/mm/mincore.c +++ v3.16-rc3/mm/mincore.c @@ -19,38 +19,26 @@ #include #include -static void mincore_hugetlb_page_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, - unsigned char *vec) +static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, + unsigned long end, struct mm_walk *walk) { + int err = 0; #ifdef CONFIG_HUGETLB_PAGE - struct hstate *h; + unsigned char present; + unsigned char *vec = walk->private; - h = hstate_vma(vma); - while (1) { - unsigned char present; - pte_t *ptep; - /* - * Huge pages are always in RAM for now, but - * theoretically it needs to be checked. - */ - ptep = huge_pte_offset(current->mm, - addr & huge_page_mask(h)); - present = ptep && !huge_pte_none(huge_ptep_get(ptep)); - while (1) { - *vec = present; - vec++; - addr += PAGE_SIZE; - if (addr == end) - return; - /* check hugepage border */ - if (!(addr & ~huge_page_mask(h))) - break; - } - } + /* + * Hugepages under user process are always in RAM and never + * swapped out, but theoretically it needs to be checked. + */ + present = pte && !huge_pte_none(huge_ptep_get(pte)); + for (; addr != end; vec++, addr += PAGE_SIZE) + *vec = present; + walk->private += (end - addr) >> PAGE_SHIFT; #else BUG(); #endif + return err; } /* @@ -94,14 +82,15 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) return present; } -static void mincore_unmapped_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, - unsigned char *vec) +static int mincore_unmapped_range(unsigned long addr, unsigned long end, + struct mm_walk *walk) { + struct vm_area_struct *vma = walk->vma; + unsigned char *vec = walk->private; unsigned long nr = (end - addr) >> PAGE_SHIFT; int i; - if (vma->vm_file) { + if (vma && vma->vm_file) { pgoff_t pgoff; pgoff = linear_page_index(vma, addr); @@ -111,25 +100,38 @@ static void mincore_unmapped_range(struct vm_area_struct *vma, for (i = 0; i < nr; i++) vec[i] = 0; } + walk->private += nr; + return 0; } -static void mincore_pte_range(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, unsigned long end, - unsigned char *vec) +static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, + struct mm_walk *walk) { - unsigned long next; spinlock_t *ptl; + struct vm_area_struct *vma = walk->vma; pte_t *ptep; - ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); - do { + if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { + memset(walk->private, 1, (end - addr) >> PAGE_SHIFT); + walk->private += (end - addr) >> PAGE_SHIFT; + spin_unlock(ptl); + return 0; + } + + if (pmd_trans_unstable(pmd)) + return 0; + + ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + for (; addr != end; ptep++, addr += PAGE_SIZE) { pte_t pte = *ptep; pgoff_t pgoff; + unsigned char *vec = walk->private; - next = addr + PAGE_SIZE; - if (pte_none(pte)) - mincore_unmapped_range(vma, addr, next, vec); - else if (pte_present(pte)) + if (pte_none(pte)) { + mincore_unmapped_range(addr, addr + PAGE_SIZE, walk); + continue; + } + if (pte_present(pte)) *vec = 1; else if (pte_file(pte)) { pgoff = pte_to_pgoff(pte); @@ -151,70 +153,11 @@ static void mincore_pte_range(struct vm_area_struct *vma, pmd_t *pmd, #endif } } - vec++; - } while (ptep++, addr = next, addr != end); + walk->private++; + } pte_unmap_unlock(ptep - 1, ptl); -} - -static void mincore_pmd_range(struct vm_area_struct *vma, pud_t *pud, - unsigned long addr, unsigned long end, - unsigned char *vec) -{ - unsigned long next; - pmd_t *pmd; - - pmd = pmd_offset(pud, addr); - do { - next = pmd_addr_end(addr, end); - if (pmd_trans_huge(*pmd)) { - if (mincore_huge_pmd(vma, pmd, addr, next, vec)) { - vec += (next - addr) >> PAGE_SHIFT; - continue; - } - /* fall through */ - } - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) - mincore_unmapped_range(vma, addr, next, vec); - else - mincore_pte_range(vma, pmd, addr, next, vec); - vec += (next - addr) >> PAGE_SHIFT; - } while (pmd++, addr = next, addr != end); -} - -static void mincore_pud_range(struct vm_area_struct *vma, pgd_t *pgd, - unsigned long addr, unsigned long end, - unsigned char *vec) -{ - unsigned long next; - pud_t *pud; - - pud = pud_offset(pgd, addr); - do { - next = pud_addr_end(addr, end); - if (pud_none_or_clear_bad(pud)) - mincore_unmapped_range(vma, addr, next, vec); - else - mincore_pmd_range(vma, pud, addr, next, vec); - vec += (next - addr) >> PAGE_SHIFT; - } while (pud++, addr = next, addr != end); -} - -static void mincore_page_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, - unsigned char *vec) -{ - unsigned long next; - pgd_t *pgd; - - pgd = pgd_offset(vma->vm_mm, addr); - do { - next = pgd_addr_end(addr, end); - if (pgd_none_or_clear_bad(pgd)) - mincore_unmapped_range(vma, addr, next, vec); - else - mincore_pud_range(vma, pgd, addr, next, vec); - vec += (next - addr) >> PAGE_SHIFT; - } while (pgd++, addr = next, addr != end); + cond_resched(); + return 0; } /* @@ -225,20 +168,28 @@ static void mincore_page_range(struct vm_area_struct *vma, static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec) { struct vm_area_struct *vma; - unsigned long end; + int err; + struct mm_walk mincore_walk = { + .pmd_entry = mincore_pte_range, + .pte_hole = mincore_unmapped_range, + .hugetlb_entry = mincore_hugetlb, + .private = vec, + }; vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; + mincore_walk.mm = vma->vm_mm; - end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); - - if (is_vm_hugetlb_page(vma)) - mincore_hugetlb_page_range(vma, addr, end, vec); - else - mincore_page_range(vma, addr, end, vec); + err = walk_page_vma(vma, &mincore_walk); + if (err < 0) + return err; + else { + unsigned long end; - return (end - addr) >> PAGE_SHIFT; + end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + return (end - addr) >> PAGE_SHIFT; + } } /* -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964800AbaGARII (ORCPT ); Tue, 1 Jul 2014 13:08:08 -0400 Received: from mx1.redhat.com ([209.132.183.28]:36322 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758645AbaGARIC (ORCPT ); Tue, 1 Jul 2014 13:08:02 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: [PATCH v4 10/13] arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma() Date: Tue, 1 Jul 2014 13:07:28 -0400 Message-Id: <1404234451-21695-11-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org We don't have to use mm_walk->private to pass vma to the callback function because of mm_walk->vma. And walk_page_vma() is useful if we walk over a single vma. Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- arch/powerpc/mm/subpage-prot.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git v3.16-rc3.orig/arch/powerpc/mm/subpage-prot.c v3.16-rc3/arch/powerpc/mm/subpage-prot.c index 6c0b1f5f8d2c..fa9fb5b4c66c 100644 --- v3.16-rc3.orig/arch/powerpc/mm/subpage-prot.c +++ v3.16-rc3/arch/powerpc/mm/subpage-prot.c @@ -134,7 +134,7 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len) static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { - struct vm_area_struct *vma = walk->private; + struct vm_area_struct *vma = walk->vma; split_huge_page_pmd(vma, addr, pmd); return 0; } @@ -163,9 +163,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr, if (vma->vm_start >= (addr + len)) break; vma->vm_flags |= VM_NOHUGEPAGE; - subpage_proto_walk.private = vma; - walk_page_range(vma->vm_start, vma->vm_end, - &subpage_proto_walk); + walk_page_vma(vma, &subpage_proto_walk); vma = vma->vm_next; } } -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932385AbaGARIG (ORCPT ); Tue, 1 Jul 2014 13:08:06 -0400 Received: from mx1.redhat.com ([209.132.183.28]:59040 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758555AbaGARIC (ORCPT ); Tue, 1 Jul 2014 13:08:02 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: [PATCH v4 08/13] numa_maps: remove numa_maps->vma Date: Tue, 1 Jul 2014 13:07:26 -0400 Message-Id: <1404234451-21695-9-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And show_numa_map() walks pages on vma basis, so using walk_page_vma() is preferable. ChangeLog v4: - remove redundant vma Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- fs/proc/task_mmu.c | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-) diff --git v3.16-rc3.orig/fs/proc/task_mmu.c v3.16-rc3/fs/proc/task_mmu.c index 0d3d1ac32b2e..4ca28f401bb1 100644 --- v3.16-rc3.orig/fs/proc/task_mmu.c +++ v3.16-rc3/fs/proc/task_mmu.c @@ -1245,7 +1245,6 @@ const struct file_operations proc_pagemap_operations = { #ifdef CONFIG_NUMA struct numa_maps { - struct vm_area_struct *vma; unsigned long pages; unsigned long anon; unsigned long active; @@ -1314,18 +1313,17 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma, static int gather_pte_stats(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { - struct numa_maps *md; + struct numa_maps *md = walk->private; + struct vm_area_struct *vma = walk->vma; spinlock_t *ptl; pte_t *orig_pte; pte_t *pte; - md = walk->private; - - if (pmd_trans_huge_lock(pmd, md->vma, &ptl) == 1) { + if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { pte_t huge_pte = *(pte_t *)pmd; struct page *page; - page = can_gather_numa_stats(huge_pte, md->vma, addr); + page = can_gather_numa_stats(huge_pte, vma, addr); if (page) gather_stats(page, md, pte_dirty(huge_pte), HPAGE_PMD_SIZE/PAGE_SIZE); @@ -1337,7 +1335,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, return 0; orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); do { - struct page *page = can_gather_numa_stats(*pte, md->vma, addr); + struct page *page = can_gather_numa_stats(*pte, vma, addr); if (!page) continue; gather_stats(page, md, pte_dirty(*pte), 1); @@ -1385,7 +1383,12 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) struct file *file = vma->vm_file; struct task_struct *task = proc_priv->task; struct mm_struct *mm = vma->vm_mm; - struct mm_walk walk = {}; + struct mm_walk walk = { + .hugetlb_entry = gather_hugetlb_stats, + .pmd_entry = gather_pte_stats, + .private = md, + .mm = mm, + }; struct mempolicy *pol; char buffer[64]; int nid; @@ -1396,13 +1399,6 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) /* Ensure we start with an empty set of numa_maps statistics. */ memset(md, 0, sizeof(*md)); - md->vma = vma; - - walk.hugetlb_entry = gather_hugetlb_stats; - walk.pmd_entry = gather_pte_stats; - walk.private = md; - walk.mm = mm; - pol = get_vma_policy(task, vma, vma->vm_start); mpol_to_str(buffer, sizeof(buffer), pol); mpol_cond_put(pol); @@ -1432,7 +1428,8 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) if (is_vm_hugetlb_page(vma)) seq_puts(m, " huge"); - walk_page_range(vma->vm_start, vma->vm_end, &walk); + /* mmap_sem is held by m_start */ + walk_page_vma(vma, &walk); if (!md->pages) goto out; -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758741AbaGARJe (ORCPT ); Tue, 1 Jul 2014 13:09:34 -0400 Received: from mx1.redhat.com ([209.132.183.28]:51215 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932387AbaGARIG (ORCPT ); Tue, 1 Jul 2014 13:08:06 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi , "Kirill A. Shutemov" , Pavel Emelyanov , Andrea Arcangeli , Cyrill Gorcunov Subject: [PATCH v4 12/13] mm: /proc/pid/clear_refs: avoid split_huge_page() Date: Tue, 1 Jul 2014 13:07:30 -0400 Message-Id: <1404234451-21695-13-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Kirill A. Shutemov" Currently pagewalker splits all THP pages on any clear_refs request. It's not necessary. We can handle this on PMD level. One side effect is that soft dirty will potentially see more dirty memory, since we will mark whole THP page dirty at once. Sanity checked with CRIU test suite. More testing is required. ChangeLog: - move code for thp to clear_refs_pte_range() Signed-off-by: Kirill A. Shutemov Cc: Pavel Emelyanov Cc: Andrea Arcangeli Cc: Dave Hansen Signed-off-by: Naoya Horiguchi Cc: Cyrill Gorcunov Signed-off-by: Andrew Morton --- fs/proc/task_mmu.c | 47 ++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 44 insertions(+), 3 deletions(-) diff --git v3.16-rc3.orig/fs/proc/task_mmu.c v3.16-rc3/fs/proc/task_mmu.c index 4ca28f401bb1..50518282ca2e 100644 --- v3.16-rc3.orig/fs/proc/task_mmu.c +++ v3.16-rc3/fs/proc/task_mmu.c @@ -718,10 +718,10 @@ struct clear_refs_private { enum clear_refs_types type; }; +#ifdef CONFIG_MEM_SOFT_DIRTY static inline void clear_soft_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *pte) { -#ifdef CONFIG_MEM_SOFT_DIRTY /* * The soft-dirty tracker uses #PF-s to catch writes * to pages, so write-protect the pte as well. See the @@ -740,9 +740,35 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, } set_pte_at(vma->vm_mm, addr, pte, ptent); -#endif } +static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmdp) +{ + pmd_t pmd = *pmdp; + + pmd = pmd_wrprotect(pmd); + pmd = pmd_clear_flags(pmd, _PAGE_SOFT_DIRTY); + + if (vma->vm_flags & VM_SOFTDIRTY) + vma->vm_flags &= ~VM_SOFTDIRTY; + + set_pmd_at(vma->vm_mm, addr, pmdp, pmd); +} + +#else + +static inline void clear_soft_dirty(struct vm_area_struct *vma, + unsigned long addr, pte_t *pte) +{ +} + +static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmdp) +{ +} +#endif + static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -752,7 +778,22 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, spinlock_t *ptl; struct page *page; - split_huge_page_pmd(vma, addr, pmd); + if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { + if (cp->type == CLEAR_REFS_SOFT_DIRTY) { + clear_soft_dirty_pmd(vma, addr, pmd); + goto out; + } + + page = pmd_page(*pmd); + + /* Clear accessed and referenced bits. */ + pmdp_test_and_clear_young(vma, addr, pmd); + ClearPageReferenced(page); +out: + spin_unlock(ptl); + return 0; + } + if (pmd_trans_unstable(pmd)) return 0; -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932241AbaGARIE (ORCPT ); Tue, 1 Jul 2014 13:08:04 -0400 Received: from mx1.redhat.com ([209.132.183.28]:61068 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758643AbaGARIB (ORCPT ); Tue, 1 Jul 2014 13:08:01 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: [PATCH v4 07/13] numa_maps: fix typo in gather_hugetbl_stats Date: Tue, 1 Jul 2014 13:07:25 -0400 Message-Id: <1404234451-21695-8-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Just doing s/gather_hugetbl_stats/gather_hugetlb_stats/g, this makes code grep-friendly. Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- fs/proc/task_mmu.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git v3.16-rc3.orig/fs/proc/task_mmu.c v3.16-rc3/fs/proc/task_mmu.c index 5ebc238d1a38..0d3d1ac32b2e 100644 --- v3.16-rc3.orig/fs/proc/task_mmu.c +++ v3.16-rc3/fs/proc/task_mmu.c @@ -1347,7 +1347,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, return 0; } #ifdef CONFIG_HUGETLB_PAGE -static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, +static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct numa_maps *md; @@ -1366,7 +1366,7 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, } #else -static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask, +static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { return 0; @@ -1398,7 +1398,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) md->vma = vma; - walk.hugetlb_entry = gather_hugetbl_stats; + walk.hugetlb_entry = gather_hugetlb_stats; walk.pmd_entry = gather_pte_stats; walk.private = md; walk.mm = mm; -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758756AbaGARKU (ORCPT ); Tue, 1 Jul 2014 13:10:20 -0400 Received: from mx1.redhat.com ([209.132.183.28]:9767 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758646AbaGARID (ORCPT ); Tue, 1 Jul 2014 13:08:03 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: [PATCH v4 11/13] mempolicy: apply page table walker on queue_pages_range() Date: Tue, 1 Jul 2014 13:07:29 -0400 Message-Id: <1404234451-21695-12-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org queue_pages_range() does page table walking in its own way now, but there is some code duplicate. This patch applies page table walker to reduce lines of code. queue_pages_range() has to do some precheck to determine whether we really walk over the vma or just skip it. Now we have test_walk() callback in mm_walk for this purpose, so we can do this replacement cleanly. queue_pages_test_walk() depends on not only the current vma but also the previous one, so queue_pages->prev is introduced to remember it. ChangeLog v4: - rebase to v3.16-rc3, where the return value of queue_pages_range() becomes 0 in success instead of the first found vma, and use -EFAILT instead of ERR_PTR() in failure. Signed-off-by: Naoya Horiguchi --- mm/mempolicy.c | 224 +++++++++++++++++++++++---------------------------------- 1 file changed, 90 insertions(+), 134 deletions(-) diff --git v3.16-rc3.orig/mm/mempolicy.c v3.16-rc3/mm/mempolicy.c index eb58de19f815..47a6fa913b83 100644 --- v3.16-rc3.orig/mm/mempolicy.c +++ v3.16-rc3/mm/mempolicy.c @@ -479,24 +479,34 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { static void migrate_page_add(struct page *page, struct list_head *pagelist, unsigned long flags); +struct queue_pages { + struct list_head *pagelist; + unsigned long flags; + nodemask_t *nmask; + struct vm_area_struct *prev; +}; + /* * Scan through pages checking if pages follow certain conditions, * and move them to the pagelist if they do. */ -static int queue_pages_pte_range(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, unsigned long end, - const nodemask_t *nodes, unsigned long flags, - void *private) +static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, + unsigned long end, struct mm_walk *walk) { - pte_t *orig_pte; + struct vm_area_struct *vma = walk->vma; + struct page *page; + struct queue_pages *qp = walk->private; + unsigned long flags = qp->flags; + int nid; pte_t *pte; spinlock_t *ptl; - orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); - do { - struct page *page; - int nid; + split_huge_page_pmd(vma, addr, pmd); + if (pmd_trans_unstable(pmd)) + return 0; + pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + for (; addr != end; pte++, addr += PAGE_SIZE) { if (!pte_present(*pte)) continue; page = vm_normal_page(vma, addr, *pte); @@ -509,114 +519,46 @@ static int queue_pages_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (PageReserved(page)) continue; nid = page_to_nid(page); - if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT)) + if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT)) continue; if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) - migrate_page_add(page, private, flags); - else - break; - } while (pte++, addr += PAGE_SIZE, addr != end); - pte_unmap_unlock(orig_pte, ptl); - return addr != end; + migrate_page_add(page, qp->pagelist, flags); + } + pte_unmap_unlock(pte - 1, ptl); + cond_resched(); + return 0; } -static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma, - pmd_t *pmd, const nodemask_t *nodes, unsigned long flags, - void *private) +static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask, + unsigned long addr, unsigned long end, + struct mm_walk *walk) { #ifdef CONFIG_HUGETLB_PAGE + struct queue_pages *qp = walk->private; + unsigned long flags = qp->flags; int nid; struct page *page; spinlock_t *ptl; pte_t entry; - ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd); - entry = huge_ptep_get((pte_t *)pmd); + ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte); + entry = huge_ptep_get(pte); if (!pte_present(entry)) goto unlock; page = pte_page(entry); nid = page_to_nid(page); - if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT)) + if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT)) goto unlock; /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */ if (flags & (MPOL_MF_MOVE_ALL) || (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) - isolate_huge_page(page, private); + isolate_huge_page(page, qp->pagelist); unlock: spin_unlock(ptl); #else BUG(); #endif -} - -static inline int queue_pages_pmd_range(struct vm_area_struct *vma, pud_t *pud, - unsigned long addr, unsigned long end, - const nodemask_t *nodes, unsigned long flags, - void *private) -{ - pmd_t *pmd; - unsigned long next; - - pmd = pmd_offset(pud, addr); - do { - next = pmd_addr_end(addr, end); - if (!pmd_present(*pmd)) - continue; - if (pmd_huge(*pmd) && is_vm_hugetlb_page(vma)) { - queue_pages_hugetlb_pmd_range(vma, pmd, nodes, - flags, private); - continue; - } - split_huge_page_pmd(vma, addr, pmd); - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) - continue; - if (queue_pages_pte_range(vma, pmd, addr, next, nodes, - flags, private)) - return -EIO; - } while (pmd++, addr = next, addr != end); - return 0; -} - -static inline int queue_pages_pud_range(struct vm_area_struct *vma, pgd_t *pgd, - unsigned long addr, unsigned long end, - const nodemask_t *nodes, unsigned long flags, - void *private) -{ - pud_t *pud; - unsigned long next; - - pud = pud_offset(pgd, addr); - do { - next = pud_addr_end(addr, end); - if (pud_huge(*pud) && is_vm_hugetlb_page(vma)) - continue; - if (pud_none_or_clear_bad(pud)) - continue; - if (queue_pages_pmd_range(vma, pud, addr, next, nodes, - flags, private)) - return -EIO; - } while (pud++, addr = next, addr != end); - return 0; -} - -static inline int queue_pages_pgd_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, - const nodemask_t *nodes, unsigned long flags, - void *private) -{ - pgd_t *pgd; - unsigned long next; - - pgd = pgd_offset(vma->vm_mm, addr); - do { - next = pgd_addr_end(addr, end); - if (pgd_none_or_clear_bad(pgd)) - continue; - if (queue_pages_pud_range(vma, pgd, addr, next, nodes, - flags, private)) - return -EIO; - } while (pgd++, addr = next, addr != end); return 0; } @@ -649,6 +591,44 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma, } #endif /* CONFIG_NUMA_BALANCING */ +static int queue_pages_test_walk(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + struct vm_area_struct *vma = walk->vma; + struct queue_pages *qp = walk->private; + unsigned long endvma = vma->vm_end; + unsigned long flags = qp->flags; + + if (endvma > end) + endvma = end; + if (vma->vm_start > start) + start = vma->vm_start; + + if (!(flags & MPOL_MF_DISCONTIG_OK)) { + if (!vma->vm_next && vma->vm_end < end) + return -EFAULT; + if (qp->prev && qp->prev->vm_end < vma->vm_start) + return -EFAULT; + } + + qp->prev = vma; + + if (vma->vm_flags & VM_PFNMAP) + return 1; + + if (flags & MPOL_MF_LAZY) { + change_prot_numa(vma, start, endvma); + return 1; + } + + if ((flags & MPOL_MF_STRICT) || + ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) && + vma_migratable(vma))) + /* queue pages from current vma */ + return 0; + return 1; +} + /* * Walk through page tables and collect pages to be migrated. * @@ -658,48 +638,24 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma, */ static int queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, - const nodemask_t *nodes, unsigned long flags, void *private) -{ - int err = 0; - struct vm_area_struct *vma, *prev; - - vma = find_vma(mm, start); - if (!vma) - return -EFAULT; - prev = NULL; - for (; vma && vma->vm_start < end; vma = vma->vm_next) { - unsigned long endvma = vma->vm_end; - - if (endvma > end) - endvma = end; - if (vma->vm_start > start) - start = vma->vm_start; - - if (!(flags & MPOL_MF_DISCONTIG_OK)) { - if (!vma->vm_next && vma->vm_end < end) - return -EFAULT; - if (prev && prev->vm_end < vma->vm_start) - return -EFAULT; - } - - if (flags & MPOL_MF_LAZY) { - change_prot_numa(vma, start, endvma); - goto next; - } - - if ((flags & MPOL_MF_STRICT) || - ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) && - vma_migratable(vma))) { - - err = queue_pages_pgd_range(vma, start, endvma, nodes, - flags, private); - if (err) - break; - } -next: - prev = vma; - } - return err; + nodemask_t *nodes, unsigned long flags, + struct list_head *pagelist) +{ + struct queue_pages qp = { + .pagelist = pagelist, + .flags = flags, + .nmask = nodes, + .prev = NULL, + }; + struct mm_walk queue_pages_walk = { + .hugetlb_entry = queue_pages_hugetlb, + .pmd_entry = queue_pages_pte_range, + .test_walk = queue_pages_test_walk, + .mm = mm, + .private = &qp, + }; + + return walk_page_range(start, end, &queue_pages_walk); } /* -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758765AbaGARKo (ORCPT ); Tue, 1 Jul 2014 13:10:44 -0400 Received: from mx1.redhat.com ([209.132.183.28]:61566 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758637AbaGARIA (ORCPT ); Tue, 1 Jul 2014 13:08:00 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: [PATCH v4 02/13] pagewalk: improve vma handling Date: Tue, 1 Jul 2014 13:07:20 -0400 Message-Id: <1404234451-21695-3-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Current implementation of page table walker has a fundamental problem in vma handling, which started when we tried to handle vma(VM_HUGETLB). Because it's done in pgd loop, considering vma boundary makes code complicated and bug-prone. >>From the users viewpoint, some user checks some vma-related condition to determine whether the user really does page walk over the vma. In order to solve these, this patch moves vma check outside pgd loop and introduce a new callback ->test_walk(). ChangeLog v4: - avoid walking over the regions where vma is NULL if pte_hole() is undefined - use vma->vm_next instead of repeating find_vma() - use min() in walk_page_range "outside vma" branch - fix return value of walk_hugetlb_range() ChangeLog v3: - drop walk->skip control Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- include/linux/mm.h | 15 +++- mm/pagewalk.c | 203 ++++++++++++++++++++++++++++++----------------------- 2 files changed, 129 insertions(+), 89 deletions(-) diff --git v3.16-rc3.orig/include/linux/mm.h v3.16-rc3/include/linux/mm.h index c5cb6394e6cb..489a63a06a4a 100644 --- v3.16-rc3.orig/include/linux/mm.h +++ v3.16-rc3/include/linux/mm.h @@ -1107,10 +1107,16 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, * @pte_entry: if set, called for each non-empty PTE (4th-level) entry * @pte_hole: if set, called for each hole at all levels * @hugetlb_entry: if set, called for each hugetlb entry - * *Caution*: The caller must hold mmap_sem() if @hugetlb_entry - * is used. + * @test_walk: caller specific callback function to determine whether + * we walk over the current vma or not. A positive returned + * value means "do page table walk over the current vma," + * and a negative one means "abort current page table walk + * right now." 0 means "skip the current vma." + * @mm: mm_struct representing the target process of page table walk + * @vma: vma currently walked (NULL if walking outside vmas) + * @private: private data for callbacks' usage * - * (see walk_page_range for more details) + * (see the comment on walk_page_range() for more details) */ struct mm_walk { int (*pmd_entry)(pmd_t *pmd, unsigned long addr, @@ -1122,7 +1128,10 @@ struct mm_walk { int (*hugetlb_entry)(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long next, struct mm_walk *walk); + int (*test_walk)(unsigned long addr, unsigned long next, + struct mm_walk *walk); struct mm_struct *mm; + struct vm_area_struct *vma; void *private; }; diff --git v3.16-rc3.orig/mm/pagewalk.c v3.16-rc3/mm/pagewalk.c index 335690650b12..91810ba875ea 100644 --- v3.16-rc3.orig/mm/pagewalk.c +++ v3.16-rc3/mm/pagewalk.c @@ -59,7 +59,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, continue; split_huge_page_pmd_mm(walk->mm, addr, pmd); - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) + if (pmd_trans_unstable(pmd)) goto again; err = walk_pte_range(pmd, addr, next, walk); if (err) @@ -95,6 +95,32 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end, return err; } +static int walk_pgd_range(unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + pgd_t *pgd; + unsigned long next; + int err = 0; + + pgd = pgd_offset(walk->mm, addr); + do { + next = pgd_addr_end(addr, end); + if (pgd_none_or_clear_bad(pgd)) { + if (walk->pte_hole) + err = walk->pte_hole(addr, next, walk); + if (err) + break; + continue; + } + if (walk->pmd_entry || walk->pte_entry) + err = walk_pud_range(pgd, addr, next, walk); + if (err) + break; + } while (pgd++, addr = next, addr != end); + + return err; +} + #ifdef CONFIG_HUGETLB_PAGE static unsigned long hugetlb_entry_end(struct hstate *h, unsigned long addr, unsigned long end) @@ -103,10 +129,10 @@ static unsigned long hugetlb_entry_end(struct hstate *h, unsigned long addr, return boundary < end ? boundary : end; } -static int walk_hugetlb_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, +static int walk_hugetlb_range(unsigned long addr, unsigned long end, struct mm_walk *walk) { + struct vm_area_struct *vma = walk->vma; struct hstate *h = hstate_vma(vma); unsigned long next; unsigned long hmask = huge_page_mask(h); @@ -119,15 +145,14 @@ static int walk_hugetlb_range(struct vm_area_struct *vma, if (pte && walk->hugetlb_entry) err = walk->hugetlb_entry(pte, hmask, addr, next, walk); if (err) - return err; + break; } while (addr = next, addr != end); - return 0; + return err; } #else /* CONFIG_HUGETLB_PAGE */ -static int walk_hugetlb_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, +static int walk_hugetlb_range(unsigned long addr, unsigned long end, struct mm_walk *walk) { return 0; @@ -135,109 +160,115 @@ static int walk_hugetlb_range(struct vm_area_struct *vma, #endif /* CONFIG_HUGETLB_PAGE */ +/* + * Decide whether we really walk over the current vma on [@start, @end) + * or skip it via the returned value. Return 0 if we do walk over the + * current vma, and return 1 if we skip the vma. Negative values means + * error, where we abort the current walk. + * + * Default check (only VM_PFNMAP check for now) is used when the caller + * doesn't define test_walk() callback. + */ +static int walk_page_test(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + struct vm_area_struct *vma = walk->vma; + if (walk->test_walk) + return walk->test_walk(start, end, walk); + + /* + * Do not walk over vma(VM_PFNMAP), because we have no valid struct + * page backing a VM_PFNMAP range. See also commit a9ff785e4437. + */ + if (vma->vm_flags & VM_PFNMAP) + return 1; + return 0; +} + +static int __walk_page_range(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + int err = 0; + struct vm_area_struct *vma = walk->vma; + + if (vma && is_vm_hugetlb_page(vma)) { + if (walk->hugetlb_entry) + err = walk_hugetlb_range(start, end, walk); + } else + err = walk_pgd_range(start, end, walk); + + return err; +} /** - * walk_page_range - walk a memory map's page tables with a callback - * @addr: starting address - * @end: ending address - * @walk: set of callbacks to invoke for each level of the tree - * - * Recursively walk the page table for the memory area in a VMA, - * calling supplied callbacks. Callbacks are called in-order (first - * PGD, first PUD, first PMD, first PTE, second PTE... second PMD, - * etc.). If lower-level callbacks are omitted, walking depth is reduced. + * walk_page_range - walk page table with caller specific callbacks * - * Each callback receives an entry pointer and the start and end of the - * associated range, and a copy of the original mm_walk for access to - * the ->private or ->mm fields. + * Recursively walk the page table tree of the process represented by @walk->mm + * within the virtual address range [@start, @end). During walking, we can do + * some caller-specific works for each entry, by setting up pmd_entry(), + * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these + * callbacks, the associated entries/pages are just ignored. + * The return values of these callbacks are commonly defined like below: + * - 0 : succeeded to handle the current entry, and if you don't reach the + * end address yet, continue to walk. + * - >0 : succeeded to handle the current entry, and return to the caller + * with caller specific value. + * - <0 : failed to handle the current entry, and return to the caller + * with error code. * - * Usually no locks are taken, but splitting transparent huge page may - * take page table lock. And the bottom level iterator will map PTE - * directories from highmem if necessary. + * Before starting to walk page table, some callers want to check whether + * they really want to walk over the current vma, typically by checking + * its vm_flags. walk_page_test() and @walk->test_walk() are used for this + * purpose. * - * If any callback returns a non-zero value, the walk is aborted and - * the return value is propagated back to the caller. Otherwise 0 is returned. + * struct mm_walk keeps current values of some common data like vma and pmd, + * which are useful for the access from callbacks. If you want to pass some + * caller-specific data to callbacks, @walk->private should be helpful. * - * walk->mm->mmap_sem must be held for at least read if walk->hugetlb_entry - * is !NULL. + * Locking: + * Callers of walk_page_range() and walk_page_vma() should hold + * @walk->mm->mmap_sem, because these function traverse vma list and/or + * access to vma's data. */ -int walk_page_range(unsigned long addr, unsigned long end, +int walk_page_range(unsigned long start, unsigned long end, struct mm_walk *walk) { - pgd_t *pgd; - unsigned long next; int err = 0; + unsigned long next; + struct vm_area_struct *vma; - if (addr >= end) - return err; + if (start >= end) + return -EINVAL; if (!walk->mm) return -EINVAL; VM_BUG_ON(!rwsem_is_locked(&walk->mm->mmap_sem)); - pgd = pgd_offset(walk->mm, addr); + vma = find_vma(walk->mm, start); do { - struct vm_area_struct *vma = NULL; + if (!vma) { /* after the last vma */ + walk->vma = NULL; + next = end; + } else if (start < vma->vm_start) { /* outside vma */ + walk->vma = NULL; + next = min(end, vma->vm_start); + } else { /* inside vma */ + walk->vma = vma; + next = min(end, vma->vm_end); + vma = vma->vm_next; - next = pgd_addr_end(addr, end); - - /* - * This function was not intended to be vma based. - * But there are vma special cases to be handled: - * - hugetlb vma's - * - VM_PFNMAP vma's - */ - vma = find_vma(walk->mm, addr); - if (vma) { - /* - * There are no page structures backing a VM_PFNMAP - * range, so do not allow split_huge_page_pmd(). - */ - if ((vma->vm_start <= addr) && - (vma->vm_flags & VM_PFNMAP)) { - next = vma->vm_end; - pgd = pgd_offset(walk->mm, next); + err = walk_page_test(start, next, walk); + if (err > 0) continue; - } - /* - * Handle hugetlb vma individually because pagetable - * walk for the hugetlb page is dependent on the - * architecture and we can't handled it in the same - * manner as non-huge pages. - */ - if (walk->hugetlb_entry && (vma->vm_start <= addr) && - is_vm_hugetlb_page(vma)) { - if (vma->vm_end < next) - next = vma->vm_end; - /* - * Hugepage is very tightly coupled with vma, - * so walk through hugetlb entries within a - * given vma. - */ - err = walk_hugetlb_range(vma, addr, next, walk); - if (err) - break; - pgd = pgd_offset(walk->mm, next); - continue; - } - } - - if (pgd_none_or_clear_bad(pgd)) { - if (walk->pte_hole) - err = walk->pte_hole(addr, next, walk); - if (err) + if (err < 0) break; - pgd++; - continue; } - if (walk->pmd_entry || walk->pte_entry) - err = walk_pud_range(pgd, addr, next, walk); + if (walk->vma || walk->pte_hole) + err = __walk_page_range(start, next, walk); if (err) break; - pgd++; - } while (addr = next, addr < end); - + } while (start = next, start < end); return err; } -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758787AbaGARLH (ORCPT ); Tue, 1 Jul 2014 13:11:07 -0400 Received: from mx1.redhat.com ([209.132.183.28]:28109 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758630AbaGARIA (ORCPT ); Tue, 1 Jul 2014 13:08:00 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: [PATCH v4 04/13] smaps: remove mem_size_stats->vma and use walk_page_vma() Date: Tue, 1 Jul 2014 13:07:22 -0400 Message-Id: <1404234451-21695-5-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And show_smap() walks pages on vma basis, so using walk_page_vma() is preferable. ChangeLog v4: - remove redundant vma Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- fs/proc/task_mmu.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git v3.16-rc3.orig/fs/proc/task_mmu.c v3.16-rc3/fs/proc/task_mmu.c index cfa63ee92c96..3067bf08393b 100644 --- v3.16-rc3.orig/fs/proc/task_mmu.c +++ v3.16-rc3/fs/proc/task_mmu.c @@ -430,7 +430,6 @@ const struct file_operations proc_tid_maps_operations = { #ifdef CONFIG_PROC_PAGE_MONITOR struct mem_size_stats { - struct vm_area_struct *vma; unsigned long resident; unsigned long shared_clean; unsigned long shared_dirty; @@ -449,7 +448,7 @@ static void smaps_pte_entry(pte_t ptent, unsigned long addr, unsigned long ptent_size, struct mm_walk *walk) { struct mem_size_stats *mss = walk->private; - struct vm_area_struct *vma = mss->vma; + struct vm_area_struct *vma = walk->vma; pgoff_t pgoff = linear_page_index(vma, addr); struct page *page = NULL; int mapcount; @@ -501,7 +500,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct mem_size_stats *mss = walk->private; - struct vm_area_struct *vma = mss->vma; + struct vm_area_struct *vma = walk->vma; pte_t *pte; spinlock_t *ptl; @@ -594,10 +593,8 @@ static int show_smap(struct seq_file *m, void *v, int is_pid) }; memset(&mss, 0, sizeof mss); - mss.vma = vma; /* mmap_sem is held in m_start */ - if (vma->vm_mm && !is_vm_hugetlb_page(vma)) - walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk); + walk_page_vma(vma, &smaps_walk); show_map_vma(m, vma, is_pid); -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964841AbaGARLd (ORCPT ); Tue, 1 Jul 2014 13:11:33 -0400 Received: from mx1.redhat.com ([209.132.183.28]:37369 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758585AbaGARH7 (ORCPT ); Tue, 1 Jul 2014 13:07:59 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: [PATCH v4 09/13] memcg: cleanup preparation for page table walk Date: Tue, 1 Jul 2014 13:07:27 -0400 Message-Id: <1404234451-21695-10-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And both of mem_cgroup_count_precharge() and mem_cgroup_move_charge() do for each vma loop themselves, but now it's done in pagewalk.c, so let's clean up them. ChangeLog v4: - use walk_page_range() instead of walk_page_vma() with for loop. Signed-off-by: Naoya Horiguchi --- mm/memcontrol.c | 49 ++++++++++++++++--------------------------------- 1 file changed, 16 insertions(+), 33 deletions(-) diff --git v3.16-rc3.orig/mm/memcontrol.c v3.16-rc3/mm/memcontrol.c index a2c7bcb0e6eb..6c075113c363 100644 --- v3.16-rc3.orig/mm/memcontrol.c +++ v3.16-rc3/mm/memcontrol.c @@ -6654,7 +6654,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { - struct vm_area_struct *vma = walk->private; + struct vm_area_struct *vma = walk->vma; pte_t *pte; spinlock_t *ptl; @@ -6680,20 +6680,13 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd, static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm) { unsigned long precharge; - struct vm_area_struct *vma; + struct mm_walk mem_cgroup_count_precharge_walk = { + .pmd_entry = mem_cgroup_count_precharge_pte_range, + .mm = mm, + }; down_read(&mm->mmap_sem); - for (vma = mm->mmap; vma; vma = vma->vm_next) { - struct mm_walk mem_cgroup_count_precharge_walk = { - .pmd_entry = mem_cgroup_count_precharge_pte_range, - .mm = mm, - .private = vma, - }; - if (is_vm_hugetlb_page(vma)) - continue; - walk_page_range(vma->vm_start, vma->vm_end, - &mem_cgroup_count_precharge_walk); - } + walk_page_range(0, ~0UL, &mem_cgroup_count_precharge_walk); up_read(&mm->mmap_sem); precharge = mc.precharge; @@ -6832,7 +6825,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, struct mm_walk *walk) { int ret = 0; - struct vm_area_struct *vma = walk->private; + struct vm_area_struct *vma = walk->vma; pte_t *pte; spinlock_t *ptl; enum mc_target_type target_type; @@ -6932,7 +6925,10 @@ put: /* get_mctgt_type() gets the page */ static void mem_cgroup_move_charge(struct mm_struct *mm) { - struct vm_area_struct *vma; + struct mm_walk mem_cgroup_move_charge_walk = { + .pmd_entry = mem_cgroup_move_charge_pte_range, + .mm = mm, + }; lru_add_drain_all(); retry: @@ -6948,24 +6944,11 @@ static void mem_cgroup_move_charge(struct mm_struct *mm) cond_resched(); goto retry; } - for (vma = mm->mmap; vma; vma = vma->vm_next) { - int ret; - struct mm_walk mem_cgroup_move_charge_walk = { - .pmd_entry = mem_cgroup_move_charge_pte_range, - .mm = mm, - .private = vma, - }; - if (is_vm_hugetlb_page(vma)) - continue; - ret = walk_page_range(vma->vm_start, vma->vm_end, - &mem_cgroup_move_charge_walk); - if (ret) - /* - * means we have consumed all precharges and failed in - * doing additional charge. Just abandon here. - */ - break; - } + /* + * When we have consumed all precharges and failed in doing + * additional charge, the page walk just aborts. + */ + walk_page_range(0, ~0UL, &mem_cgroup_move_charge_walk); up_read(&mm->mmap_sem); } -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758612AbaGARH7 (ORCPT ); Tue, 1 Jul 2014 13:07:59 -0400 Received: from mx1.redhat.com ([209.132.183.28]:62988 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758585AbaGARHz (ORCPT ); Tue, 1 Jul 2014 13:07:55 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: [PATCH v4 06/13] pagemap: use walk->vma instead of calling find_vma() Date: Tue, 1 Jul 2014 13:07:24 -0400 Message-Id: <1404234451-21695-7-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Page table walker has the information of the current vma in mm_walk, so we don't have to call find_vma() in each pagemap_hugetlb_range() call. NULL-vma check is omitted because we assume that we never run hugetlb_entry() callback on the address without vma. And even if it were broken, null pointer dereference would be detected, so we can get enough information for debugging. Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- fs/proc/task_mmu.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git v3.16-rc3.orig/fs/proc/task_mmu.c v3.16-rc3/fs/proc/task_mmu.c index df9f368e01b7..5ebc238d1a38 100644 --- v3.16-rc3.orig/fs/proc/task_mmu.c +++ v3.16-rc3/fs/proc/task_mmu.c @@ -1080,15 +1080,12 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask, struct mm_walk *walk) { struct pagemapread *pm = walk->private; - struct vm_area_struct *vma; + struct vm_area_struct *vma = walk->vma; int err = 0; int flags2; pagemap_entry_t pme; - vma = find_vma(walk->mm, addr); - WARN_ON_ONCE(!vma); - - if (vma && (vma->vm_flags & VM_SOFTDIRTY)) + if (vma->vm_flags & VM_SOFTDIRTY) flags2 = __PM_SOFT_DIRTY; else flags2 = 0; -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964870AbaGARMA (ORCPT ); Tue, 1 Jul 2014 13:12:00 -0400 Received: from mx1.redhat.com ([209.132.183.28]:41007 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758569AbaGARHw (ORCPT ); Tue, 1 Jul 2014 13:07:52 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: [PATCH v4 03/13] pagewalk: add walk_page_vma() Date: Tue, 1 Jul 2014 13:07:21 -0400 Message-Id: <1404234451-21695-4-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Introduces walk_page_vma(), which is useful for the callers which want to walk over a given vma. It's used by later patches. ChangeLog v3: - check walk_page_test's return value instead of walk->skip Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- include/linux/mm.h | 1 + mm/pagewalk.c | 18 ++++++++++++++++++ 2 files changed, 19 insertions(+) diff --git v3.16-rc3.orig/include/linux/mm.h v3.16-rc3/include/linux/mm.h index 489a63a06a4a..7e9287750866 100644 --- v3.16-rc3.orig/include/linux/mm.h +++ v3.16-rc3/include/linux/mm.h @@ -1137,6 +1137,7 @@ struct mm_walk { int walk_page_range(unsigned long addr, unsigned long end, struct mm_walk *walk); +int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk); void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, diff --git v3.16-rc3.orig/mm/pagewalk.c v3.16-rc3/mm/pagewalk.c index 91810ba875ea..65fb68df3aa2 100644 --- v3.16-rc3.orig/mm/pagewalk.c +++ v3.16-rc3/mm/pagewalk.c @@ -272,3 +272,21 @@ int walk_page_range(unsigned long start, unsigned long end, } while (start = next, start < end); return err; } + +int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk) +{ + int err; + + if (!walk->mm) + return -EINVAL; + + VM_BUG_ON(!rwsem_is_locked(&walk->mm->mmap_sem)); + VM_BUG_ON(!vma); + walk->vma = vma; + err = walk_page_test(vma->vm_start, vma->vm_end, walk); + if (err > 0) + return 0; + if (err < 0) + return err; + return __walk_page_range(vma->vm_start, vma->vm_end, walk); +} -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758507AbaGARMU (ORCPT ); Tue, 1 Jul 2014 13:12:20 -0400 Received: from mx1.redhat.com ([209.132.183.28]:8026 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758565AbaGARHv (ORCPT ); Tue, 1 Jul 2014 13:07:51 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Andrew Morton , Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: [PATCH v4 01/13] mm/pagewalk: remove pgd_entry() and pud_entry() Date: Tue, 1 Jul 2014 13:07:19 -0400 Message-Id: <1404234451-21695-2-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently no user of page table walker sets ->pgd_entry() or ->pud_entry(), so checking their existence in each loop is just wasting CPU cycle. So let's remove it to reduce overhead. Signed-off-by: Naoya Horiguchi Acked-by: Kirill A. Shutemov --- include/linux/mm.h | 6 ------ mm/pagewalk.c | 9 ++------- 2 files changed, 2 insertions(+), 13 deletions(-) diff --git v3.16-rc3.orig/include/linux/mm.h v3.16-rc3/include/linux/mm.h index e03dd29145a0..c5cb6394e6cb 100644 --- v3.16-rc3.orig/include/linux/mm.h +++ v3.16-rc3/include/linux/mm.h @@ -1100,8 +1100,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, /** * mm_walk - callbacks for walk_page_range - * @pgd_entry: if set, called for each non-empty PGD (top-level) entry - * @pud_entry: if set, called for each non-empty PUD (2nd-level) entry * @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry * this handler is required to be able to handle * pmd_trans_huge() pmds. They may simply choose to @@ -1115,10 +1113,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, * (see walk_page_range for more details) */ struct mm_walk { - int (*pgd_entry)(pgd_t *pgd, unsigned long addr, - unsigned long next, struct mm_walk *walk); - int (*pud_entry)(pud_t *pud, unsigned long addr, - unsigned long next, struct mm_walk *walk); int (*pmd_entry)(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk); int (*pte_entry)(pte_t *pte, unsigned long addr, diff --git v3.16-rc3.orig/mm/pagewalk.c v3.16-rc3/mm/pagewalk.c index 2beeabf502c5..335690650b12 100644 --- v3.16-rc3.orig/mm/pagewalk.c +++ v3.16-rc3/mm/pagewalk.c @@ -86,9 +86,7 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end, break; continue; } - if (walk->pud_entry) - err = walk->pud_entry(pud, addr, next, walk); - if (!err && (walk->pmd_entry || walk->pte_entry)) + if (walk->pmd_entry || walk->pte_entry) err = walk_pmd_range(pud, addr, next, walk); if (err) break; @@ -234,10 +232,7 @@ int walk_page_range(unsigned long addr, unsigned long end, pgd++; continue; } - if (walk->pgd_entry) - err = walk->pgd_entry(pgd, addr, next, walk); - if (!err && - (walk->pud_entry || walk->pmd_entry || walk->pte_entry)) + if (walk->pmd_entry || walk->pte_entry) err = walk_pud_range(pgd, addr, next, walk); if (err) break; -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754351AbaGAVAk (ORCPT ); Tue, 1 Jul 2014 17:00:40 -0400 Received: from mga02.intel.com ([134.134.136.20]:7357 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751744AbaGAVAj (ORCPT ); Tue, 1 Jul 2014 17:00:39 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.01,584,1400050800"; d="scan'208";a="537386066" Message-ID: <53B32170.1040707@intel.com> Date: Tue, 01 Jul 2014 14:00:32 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Naoya Horiguchi , linux-mm@kvack.org CC: Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi , Jet Chen Subject: Re: [PATCH v4 11/13] mempolicy: apply page table walker on queue_pages_range() References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1404234451-21695-12-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1404234451-21695-12-git-send-email-n-horiguchi@ah.jp.nec.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/01/2014 10:07 AM, Naoya Horiguchi wrote: > queue_pages_range() does page table walking in its own way now, but there > is some code duplicate. This patch applies page table walker to reduce > lines of code. > > queue_pages_range() has to do some precheck to determine whether we really > walk over the vma or just skip it. Now we have test_walk() callback in > mm_walk for this purpose, so we can do this replacement cleanly. > queue_pages_test_walk() depends on not only the current vma but also the > previous one, so queue_pages->prev is introduced to remember it. Hi Naoya, The previous version of this patch caused a performance regression which was reported to you: http://marc.info/?l=linux-kernel&m=140375975525069&w=2 Has that been dealt with in this version somehow? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964952AbaGAVwT (ORCPT ); Tue, 1 Jul 2014 17:52:19 -0400 Received: from mx1.redhat.com ([209.132.183.28]:35448 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756968AbaGAVwQ (ORCPT ); Tue, 1 Jul 2014 17:52:16 -0400 Date: Tue, 1 Jul 2014 17:51:56 -0400 From: Naoya Horiguchi To: Dave Hansen Cc: linux-mm@kvack.org, Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi , Jet Chen Subject: Re: [PATCH v4 11/13] mempolicy: apply page table walker on queue_pages_range() Message-ID: <20140701215156.GA21032@nhori.bos.redhat.com> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1404234451-21695-12-git-send-email-n-horiguchi@ah.jp.nec.com> <53B32170.1040707@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <53B32170.1040707@intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 01, 2014 at 02:00:32PM -0700, Dave Hansen wrote: > On 07/01/2014 10:07 AM, Naoya Horiguchi wrote: > > queue_pages_range() does page table walking in its own way now, but there > > is some code duplicate. This patch applies page table walker to reduce > > lines of code. > > > > queue_pages_range() has to do some precheck to determine whether we really > > walk over the vma or just skip it. Now we have test_walk() callback in > > mm_walk for this purpose, so we can do this replacement cleanly. > > queue_pages_test_walk() depends on not only the current vma but also the > > previous one, so queue_pages->prev is introduced to remember it. > > Hi Naoya, > > The previous version of this patch caused a performance regression which > was reported to you: > > http://marc.info/?l=linux-kernel&m=140375975525069&w=2 > > Has that been dealt with in this version somehow? I believe so, in previous version we called ->pte_entry() callback for each pte entries, but in this version I stop doing this and most of works are done in ->pmd_entry() callback, so the number of function calls are reduced by about 1/512. And rather than that, I just cleaned up queue_pages_* without major behavioral changes, so the visible regression should be solved. Thanks, Naoya Horiguchi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755626AbaGINfA (ORCPT ); Wed, 9 Jul 2014 09:35:00 -0400 Received: from mta-out1.inet.fi ([62.71.2.198]:60593 "EHLO jenni2.inet.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752020AbaGINe7 (ORCPT ); Wed, 9 Jul 2014 09:34:59 -0400 Date: Wed, 9 Jul 2014 16:34:36 +0300 From: "Kirill A. Shutemov" To: Naoya Horiguchi Cc: linux-mm@kvack.org, Andrew Morton , Dave Hansen , Hugh Dickins , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: Re: [PATCH v4 13/13] mincore: apply page table walker on do_mincore() Message-ID: <20140709133436.GA18391@node.dhcp.inet.fi> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1404234451-21695-14-git-send-email-n-horiguchi@ah.jp.nec.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1404234451-21695-14-git-send-email-n-horiguchi@ah.jp.nec.com> User-Agent: Mutt/1.5.22.1 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 01, 2014 at 01:07:31PM -0400, Naoya Horiguchi wrote: > This patch makes do_mincore() use walk_page_vma(), which reduces many lines > of code by using common page table walk code. > > ChangeLog v4: > - remove redundant vma > > ChangeLog v3: > - add NULL vma check in mincore_unmapped_range() > - don't use pte_entry() > > ChangeLog v2: > - change type of args of callbacks to void * > - move definition of mincore_walk to the start of the function to fix compiler > warning > > Signed-off-by: Naoya Horiguchi Trinity crases this implementation of mincore pretty easily: [ 42.775369] BUG: unable to handle kernel paging request at ffff88007bb61000 [ 42.776656] IP: [] mincore_unmapped_range+0xdf/0x100 [ 42.777560] PGD 2ef6067 PUD 87fa01067 PMD 87f823067 PTE 800000007bb61060 [ 42.778529] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC [ 42.779106] Modules linked in: [ 42.779106] CPU: 0 PID: 917 Comm: trinity-c27 Not tainted 3.16.0-rc4-next-20140709-00013-g28e4629f71a8 #1450 [ 42.779106] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 42.779106] task: ffff880852e98110 ti: ffff880844024000 task.ti: ffff880844024000 [ 42.779106] RIP: 0010:[] [] mincore_unmapped_range+0xdf/0x100 [ 42.779106] RSP: 0018:ffff880844027df0 EFLAGS: 00010202 [ 42.779106] RAX: 000000000000001c RBX: 00007fc300000000 RCX: 00003ffffffff000 [ 42.779106] RDX: 000000000000001b RSI: ffff88007bb60fe5 RDI: 00007fc2c2c00000 [ 42.779106] RBP: ffff880844027e28 R08: 00007fc2c2e00000 R09: 0000000000000000 [ 42.779106] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000200 [ 42.779106] R13: ffff88007bb60fe5 R14: ffff880855a80018 R15: 00007fc2c2c00000 [ 42.779106] FS: 00007fc345666700(0000) GS:ffff880859600000(0000) knlGS:0000000000000000 [ 42.779106] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 42.779106] CR2: ffff88007bb61000 CR3: 0000000852dfd000 CR4: 00000000000006f0 [ 42.779106] Stack: [ 42.779106] ffff880844027f10 ffff88007bb60fe5 00007fc300000000 00007fc2c2e00000 [ 42.779106] 00007fc2c1e1b000 ffff880844027f10 00007fc2c2c00000 ffff880844027eb8 [ 42.779106] ffffffff81135bfe 00007fc341c1bfff ffff880000000000 ffff880852dfd7f8 [ 42.779106] Call Trace: [ 42.779106] [] __walk_page_range+0x1ae/0x450 [ 42.779106] [] walk_page_vma+0x71/0x90 [ 42.779106] [] SyS_mincore+0x1de/0x270 [ 42.779106] [] ? trace_hardirqs_on+0xd/0x10 [ 42.779106] [] ? mincore_unmapped_range+0x100/0x100 [ 42.779106] [] ? mincore_page+0xa0/0xa0 [ 42.779106] [] ? handle_mm_fault+0xd30/0xd30 [ 42.779106] [] system_call_fastpath+0x16/0x1b [ 42.779106] Code: 83 c4 10 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 40 00 31 d2 31 c0 4d 85 e4 4c 8b 6d d0 74 d3 0f 1f 00 48 8b 75 d0 83 c0 01 04 16 00 48 63 d0 49 39 d4 77 ed eb b3 48 89 fe 4c 89 f7 e8 [ 42.779106] RIP [] mincore_unmapped_range+0xdf/0x100 [ 42.779106] RSP [ 42.779106] CR2: ffff88007bb61000 [ 42.779106] ---[ end trace 3fac62521b6b0cb0 ]--- [ 42.779106] Kernel panic - not syncing: Fatal exception [ 42.779106] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff) Looks like 'vec' overflow. I don't see what could prevent do_mincore() to write more than PAGE_SIZE to 'vec'. -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755999AbaGIVgk (ORCPT ); Wed, 9 Jul 2014 17:36:40 -0400 Received: from mx1.redhat.com ([209.132.183.28]:9014 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751676AbaGIVgj (ORCPT ); Wed, 9 Jul 2014 17:36:39 -0400 Date: Wed, 9 Jul 2014 17:36:24 -0400 From: Naoya Horiguchi To: "Kirill A. Shutemov" Cc: linux-mm@kvack.org, Andrew Morton , Dave Hansen , Hugh Dickins , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: Re: [PATCH v4 13/13] mincore: apply page table walker on do_mincore() Message-ID: <20140709213624.GC24698@nhori> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1404234451-21695-14-git-send-email-n-horiguchi@ah.jp.nec.com> <20140709133436.GA18391@node.dhcp.inet.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140709133436.GA18391@node.dhcp.inet.fi> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 09, 2014 at 04:34:36PM +0300, Kirill A. Shutemov wrote: > On Tue, Jul 01, 2014 at 01:07:31PM -0400, Naoya Horiguchi wrote: > > This patch makes do_mincore() use walk_page_vma(), which reduces many lines > > of code by using common page table walk code. > > > > ChangeLog v4: > > - remove redundant vma > > > > ChangeLog v3: > > - add NULL vma check in mincore_unmapped_range() > > - don't use pte_entry() > > > > ChangeLog v2: > > - change type of args of callbacks to void * > > - move definition of mincore_walk to the start of the function to fix compiler > > warning > > > > Signed-off-by: Naoya Horiguchi > > Trinity crases this implementation of mincore pretty easily: > > [ 42.775369] BUG: unable to handle kernel paging request at ffff88007bb61000 > [ 42.776656] IP: [] mincore_unmapped_range+0xdf/0x100 Thanks for your testing/reporting. ... > > Looks like 'vec' overflow. I don't see what could prevent do_mincore() to > write more than PAGE_SIZE to 'vec'. I found the miscalculation of walk->private (vec) on thp and hugetlbfs. I confirmed that the reported problem is fixed (I checked that trinity never triggers the reported BUG) with the following changes on this patch. diff --git a/mm/mincore.c b/mm/mincore.c index 3c64dcbcb3e2..9eb10d867a6f 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -34,7 +34,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, present = pte && !huge_pte_none(huge_ptep_get(pte)); for (; addr != end; vec++, addr += PAGE_SIZE) *vec = present; - walk->private += (end - addr) >> PAGE_SHIFT; + walk->private = vec; #else BUG(); #endif @@ -118,8 +118,10 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd)) { + walk->private += (end - addr) >> PAGE_SHIFT; return 0; + } ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); for (; addr != end; ptep++, addr += PAGE_SIZE) { Thanks, Naoya Horiguchi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752991AbaGJKGW (ORCPT ); Thu, 10 Jul 2014 06:06:22 -0400 Received: from mta-out1.inet.fi ([62.71.2.198]:56389 "EHLO jenni2.inet.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752693AbaGJKGU (ORCPT ); Thu, 10 Jul 2014 06:06:20 -0400 Date: Thu, 10 Jul 2014 13:06:00 +0300 From: "Kirill A. Shutemov" To: Naoya Horiguchi Cc: linux-mm@kvack.org, Andrew Morton , Dave Hansen , Hugh Dickins , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: Re: [PATCH v4 13/13] mincore: apply page table walker on do_mincore() Message-ID: <20140710100600.GA30360@node.dhcp.inet.fi> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1404234451-21695-14-git-send-email-n-horiguchi@ah.jp.nec.com> <20140709133436.GA18391@node.dhcp.inet.fi> <20140709213624.GC24698@nhori> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140709213624.GC24698@nhori> User-Agent: Mutt/1.5.22.1 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 09, 2014 at 05:36:24PM -0400, Naoya Horiguchi wrote: > On Wed, Jul 09, 2014 at 04:34:36PM +0300, Kirill A. Shutemov wrote: > > On Tue, Jul 01, 2014 at 01:07:31PM -0400, Naoya Horiguchi wrote: > > > This patch makes do_mincore() use walk_page_vma(), which reduces many lines > > > of code by using common page table walk code. > > > > > > ChangeLog v4: > > > - remove redundant vma > > > > > > ChangeLog v3: > > > - add NULL vma check in mincore_unmapped_range() > > > - don't use pte_entry() > > > > > > ChangeLog v2: > > > - change type of args of callbacks to void * > > > - move definition of mincore_walk to the start of the function to fix compiler > > > warning > > > > > > Signed-off-by: Naoya Horiguchi > > > > Trinity crases this implementation of mincore pretty easily: > > > > [ 42.775369] BUG: unable to handle kernel paging request at ffff88007bb61000 > > [ 42.776656] IP: [] mincore_unmapped_range+0xdf/0x100 > > Thanks for your testing/reporting. > > ... > > > > Looks like 'vec' overflow. I don't see what could prevent do_mincore() to > > write more than PAGE_SIZE to 'vec'. > > I found the miscalculation of walk->private (vec) on thp and hugetlbfs. > I confirmed that the reported problem is fixed (I checked that trinity > never triggers the reported BUG) with the following changes on this patch. With the changes: [ 26.850945] BUG: unable to handle kernel paging request at ffff880852d8c000 [ 26.852718] IP: [] mincore_hugetlb+0x27/0x50 [ 26.853527] PGD 2ef6067 PUD 2ef9067 PMD 87fd4a067 PTE 8000000852d8c060 [ 26.854462] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC [ 26.854752] Modules linked in: [ 26.854752] CPU: 5 PID: 170 Comm: trinity-c5 Not tainted 3.16.0-rc4-next-20140709-00013-g28e4629f71a8-dirty #1453 [ 26.854752] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 26.854752] task: ffff880852d22890 ti: ffff880852d24000 task.ti: ffff880852d24000 [ 26.854752] RIP: 0010:[] [] mincore_hugetlb+0x27/0x50 [ 26.854752] RSP: 0018:ffff880852d27e28 EFLAGS: 00010206 [ 26.854752] RAX: ffff880852d8c000 RBX: 00007f9fb2200000 RCX: 00007f9fb2200000 [ 26.854752] RDX: 00007f9fb2001000 RSI: ffffffffffe00000 RDI: ffff88084f3edc80 [ 26.854752] RBP: ffff880852d27e28 R08: ffff880852d27f10 R09: ffffffff81126dc0 [ 26.854752] R10: 0000000000000000 R11: 0000000000000001 R12: 00007f9fde000000 [ 26.854752] R13: ffffffff82e32580 R14: 00007f9fb2000000 R15: ffff880852d27f10 [ 26.854752] FS: 00007f9fe1bde700(0000) GS:ffff88085a000000(0000) knlGS:0000000000000000 [ 26.854752] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 26.854752] CR2: ffff880852d8c000 CR3: 0000000852d12000 CR4: 00000000000006e0 [ 26.854752] Stack: [ 26.854752] ffff880852d27eb8 ffffffff81135e24 ffff880852ce01d8 0000000000000282 [ 26.854752] ffff880852ce01d8 ffff880852d22890 00007f9fde000000 ffff880852d27eb0 [ 26.854752] 0000000000000282 0000000000000000 ffffffff81127399 0000000000000282 [ 26.854752] Call Trace: [ 26.854752] [] __walk_page_range+0x3f4/0x450 [ 26.854752] [] ? SyS_mincore+0x179/0x270 [ 26.854752] [] walk_page_vma+0x71/0x90 [ 26.854752] [] SyS_mincore+0x1de/0x270 [ 26.854752] [] ? mincore_unmapped_range+0x100/0x100 [ 26.854752] [] ? mincore_page+0xa0/0xa0 [ 26.854752] [] ? handle_mm_fault+0xd30/0xd30 [ 26.854752] [] system_call_fastpath+0x16/0x1b [ 26.854752] Code: 0f 1f 40 00 55 48 85 ff 49 8b 40 38 48 89 e5 74 33 48 83 3f 00 40 0f 95 c6 48 39 ca 74 19 66 0f 1f 44 00 00 48 81 c2 00 10 00 00 <40> 88 30 48 83 c0 01 48 39 d1 75 ed 49 89 40 38 31 c0 5d c3 0f [ 26.854752] RIP [] mincore_hugetlb+0x27/0x50 [ 26.854752] RSP [ 26.854752] CR2: ffff880852d8c000 [ 26.854752] ---[ end trace 536bbdef8c6d5b03 ]--- Could you explain to me how you protect 'vec' from being overflowed? I don't any code for that. -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752701AbaGJLck (ORCPT ); Thu, 10 Jul 2014 07:32:40 -0400 Received: from mta-out1.inet.fi ([62.71.2.198]:50499 "EHLO jenni1.inet.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752216AbaGJLcj (ORCPT ); Thu, 10 Jul 2014 07:32:39 -0400 Date: Thu, 10 Jul 2014 14:32:19 +0300 From: "Kirill A. Shutemov" To: Naoya Horiguchi Cc: linux-mm@kvack.org, Andrew Morton , Dave Hansen , Hugh Dickins , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: Re: [PATCH v4 05/13] clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk() Message-ID: <20140710113219.GA30954@node.dhcp.inet.fi> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1404234451-21695-6-git-send-email-n-horiguchi@ah.jp.nec.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1404234451-21695-6-git-send-email-n-horiguchi@ah.jp.nec.com> User-Agent: Mutt/1.5.22.1 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 01, 2014 at 01:07:23PM -0400, Naoya Horiguchi wrote: > @@ -822,38 +844,14 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, > }; > struct mm_walk clear_refs_walk = { > .pmd_entry = clear_refs_pte_range, > + .test_walk = clear_refs_test_walk, > .mm = mm, > .private = &cp, > }; > down_read(&mm->mmap_sem); > if (type == CLEAR_REFS_SOFT_DIRTY) > mmu_notifier_invalidate_range_start(mm, 0, -1); > - for (vma = mm->mmap; vma; vma = vma->vm_next) { > - cp.vma = vma; > - if (is_vm_hugetlb_page(vma)) > - continue; > - /* > - * Writing 1 to /proc/pid/clear_refs affects all pages. > - * > - * Writing 2 to /proc/pid/clear_refs only affects > - * Anonymous pages. > - * > - * Writing 3 to /proc/pid/clear_refs only affects file > - * mapped pages. > - * > - * Writing 4 to /proc/pid/clear_refs affects all pages. > - */ > - if (type == CLEAR_REFS_ANON && vma->vm_file) > - continue; > - if (type == CLEAR_REFS_MAPPED && !vma->vm_file) > - continue; > - if (type == CLEAR_REFS_SOFT_DIRTY) { > - if (vma->vm_flags & VM_SOFTDIRTY) > - vma->vm_flags &= ~VM_SOFTDIRTY; > - } > - walk_page_range(vma->vm_start, vma->vm_end, > - &clear_refs_walk); > - } > + walk_page_range(0, ~0UL, &clear_refs_walk); 'vma' variable is now unused in the clear_refs_write(). -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753972AbaGJN2R (ORCPT ); Thu, 10 Jul 2014 09:28:17 -0400 Received: from mx1.redhat.com ([209.132.183.28]:43560 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752721AbaGJN2P (ORCPT ); Thu, 10 Jul 2014 09:28:15 -0400 Date: Thu, 10 Jul 2014 09:27:17 -0400 From: Naoya Horiguchi To: "Kirill A. Shutemov" Cc: linux-mm@kvack.org, Andrew Morton , Dave Hansen , Hugh Dickins , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: Re: [PATCH v4 05/13] clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk() Message-ID: <20140710132717.GA12391@nhori> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1404234451-21695-6-git-send-email-n-horiguchi@ah.jp.nec.com> <20140710113219.GA30954@node.dhcp.inet.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140710113219.GA30954@node.dhcp.inet.fi> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 10, 2014 at 02:32:19PM +0300, Kirill A. Shutemov wrote: > On Tue, Jul 01, 2014 at 01:07:23PM -0400, Naoya Horiguchi wrote: > > @@ -822,38 +844,14 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, > > }; > > struct mm_walk clear_refs_walk = { > > .pmd_entry = clear_refs_pte_range, > > + .test_walk = clear_refs_test_walk, > > .mm = mm, > > .private = &cp, > > }; > > down_read(&mm->mmap_sem); > > if (type == CLEAR_REFS_SOFT_DIRTY) > > mmu_notifier_invalidate_range_start(mm, 0, -1); > > - for (vma = mm->mmap; vma; vma = vma->vm_next) { > > - cp.vma = vma; > > - if (is_vm_hugetlb_page(vma)) > > - continue; > > - /* > > - * Writing 1 to /proc/pid/clear_refs affects all pages. > > - * > > - * Writing 2 to /proc/pid/clear_refs only affects > > - * Anonymous pages. > > - * > > - * Writing 3 to /proc/pid/clear_refs only affects file > > - * mapped pages. > > - * > > - * Writing 4 to /proc/pid/clear_refs affects all pages. > > - */ > > - if (type == CLEAR_REFS_ANON && vma->vm_file) > > - continue; > > - if (type == CLEAR_REFS_MAPPED && !vma->vm_file) > > - continue; > > - if (type == CLEAR_REFS_SOFT_DIRTY) { > > - if (vma->vm_flags & VM_SOFTDIRTY) > > - vma->vm_flags &= ~VM_SOFTDIRTY; > > - } > > - walk_page_range(vma->vm_start, vma->vm_end, > > - &clear_refs_walk); > > - } > > + walk_page_range(0, ~0UL, &clear_refs_walk); > > 'vma' variable is now unused in the clear_refs_write(). Yes, will remove it. Naoya From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752310AbaGJQgS (ORCPT ); Thu, 10 Jul 2014 12:36:18 -0400 Received: from mx1.redhat.com ([209.132.183.28]:23733 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752211AbaGJQgP (ORCPT ); Thu, 10 Jul 2014 12:36:15 -0400 Date: Thu, 10 Jul 2014 12:35:55 -0400 From: Naoya Horiguchi To: "Kirill A. Shutemov" Cc: linux-mm@kvack.org, Andrew Morton , Dave Hansen , Hugh Dickins , Jerome Marchand , linux-kernel@vger.kernel.org, Naoya Horiguchi Subject: Re: [PATCH v4 13/13] mincore: apply page table walker on do_mincore() Message-ID: <20140710163555.GB12391@nhori> References: <1404234451-21695-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1404234451-21695-14-git-send-email-n-horiguchi@ah.jp.nec.com> <20140709133436.GA18391@node.dhcp.inet.fi> <20140709213624.GC24698@nhori> <20140710100600.GA30360@node.dhcp.inet.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140710100600.GA30360@node.dhcp.inet.fi> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 10, 2014 at 01:06:00PM +0300, Kirill A. Shutemov wrote: > On Wed, Jul 09, 2014 at 05:36:24PM -0400, Naoya Horiguchi wrote: > > On Wed, Jul 09, 2014 at 04:34:36PM +0300, Kirill A. Shutemov wrote: > > > On Tue, Jul 01, 2014 at 01:07:31PM -0400, Naoya Horiguchi wrote: > > > > This patch makes do_mincore() use walk_page_vma(), which reduces many lines > > > > of code by using common page table walk code. > > > > > > > > ChangeLog v4: > > > > - remove redundant vma > > > > > > > > ChangeLog v3: > > > > - add NULL vma check in mincore_unmapped_range() > > > > - don't use pte_entry() > > > > > > > > ChangeLog v2: > > > > - change type of args of callbacks to void * > > > > - move definition of mincore_walk to the start of the function to fix compiler > > > > warning > > > > > > > > Signed-off-by: Naoya Horiguchi > > > > > > Trinity crases this implementation of mincore pretty easily: > > > > > > [ 42.775369] BUG: unable to handle kernel paging request at ffff88007bb61000 > > > [ 42.776656] IP: [] mincore_unmapped_range+0xdf/0x100 > > > > Thanks for your testing/reporting. > > > > ... > > > > > > Looks like 'vec' overflow. I don't see what could prevent do_mincore() to > > > write more than PAGE_SIZE to 'vec'. > > > > I found the miscalculation of walk->private (vec) on thp and hugetlbfs. > > I confirmed that the reported problem is fixed (I checked that trinity > > never triggers the reported BUG) with the following changes on this patch. > > With the changes: > > [ 26.850945] BUG: unable to handle kernel paging request at ffff880852d8c000 > [ 26.852718] IP: [] mincore_hugetlb+0x27/0x50 > [ 26.853527] PGD 2ef6067 PUD 2ef9067 PMD 87fd4a067 PTE 8000000852d8c060 > [ 26.854462] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC > [ 26.854752] Modules linked in: > [ 26.854752] CPU: 5 PID: 170 Comm: trinity-c5 Not tainted 3.16.0-rc4-next-20140709-00013-g28e4629f71a8-dirty #1453 > [ 26.854752] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > [ 26.854752] task: ffff880852d22890 ti: ffff880852d24000 task.ti: ffff880852d24000 > [ 26.854752] RIP: 0010:[] [] mincore_hugetlb+0x27/0x50 > [ 26.854752] RSP: 0018:ffff880852d27e28 EFLAGS: 00010206 > [ 26.854752] RAX: ffff880852d8c000 RBX: 00007f9fb2200000 RCX: 00007f9fb2200000 > [ 26.854752] RDX: 00007f9fb2001000 RSI: ffffffffffe00000 RDI: ffff88084f3edc80 > [ 26.854752] RBP: ffff880852d27e28 R08: ffff880852d27f10 R09: ffffffff81126dc0 > [ 26.854752] R10: 0000000000000000 R11: 0000000000000001 R12: 00007f9fde000000 > [ 26.854752] R13: ffffffff82e32580 R14: 00007f9fb2000000 R15: ffff880852d27f10 > [ 26.854752] FS: 00007f9fe1bde700(0000) GS:ffff88085a000000(0000) knlGS:0000000000000000 > [ 26.854752] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 26.854752] CR2: ffff880852d8c000 CR3: 0000000852d12000 CR4: 00000000000006e0 > [ 26.854752] Stack: > [ 26.854752] ffff880852d27eb8 ffffffff81135e24 ffff880852ce01d8 0000000000000282 > [ 26.854752] ffff880852ce01d8 ffff880852d22890 00007f9fde000000 ffff880852d27eb0 > [ 26.854752] 0000000000000282 0000000000000000 ffffffff81127399 0000000000000282 > [ 26.854752] Call Trace: > [ 26.854752] [] __walk_page_range+0x3f4/0x450 > [ 26.854752] [] ? SyS_mincore+0x179/0x270 > [ 26.854752] [] walk_page_vma+0x71/0x90 > [ 26.854752] [] SyS_mincore+0x1de/0x270 > [ 26.854752] [] ? mincore_unmapped_range+0x100/0x100 > [ 26.854752] [] ? mincore_page+0xa0/0xa0 > [ 26.854752] [] ? handle_mm_fault+0xd30/0xd30 > [ 26.854752] [] system_call_fastpath+0x16/0x1b > [ 26.854752] Code: 0f 1f 40 00 55 48 85 ff 49 8b 40 38 48 89 e5 74 33 48 83 3f 00 40 0f 95 c6 48 39 ca 74 19 66 0f 1f 44 00 00 48 81 c2 00 10 00 00 <40> 88 30 48 83 c0 01 48 39 d1 75 ed 49 89 40 38 31 c0 5d c3 0f > [ 26.854752] RIP [] mincore_hugetlb+0x27/0x50 > [ 26.854752] RSP > [ 26.854752] CR2: ffff880852d8c000 > [ 26.854752] ---[ end trace 536bbdef8c6d5b03 ]--- > > Could you explain to me how you protect 'vec' from being overflowed? I don't > any code for that. I don't do it explicitly, so adding it is one solution. But I think the problem comes from using walk_page_range() instead of walk_page_vma() which forcibly sets the walk range from vm->vm_start to vm->vm_end. As the original code does, limiting the range to [addr, addr + pages << PAGE_SHIFT) is fine because it implicitly prevents buffer overflow. Here is the revised fix for this patch. Please remove the one I replied yesterday because it was wrong. Thanks, Naoya Horiguchi --- diff --git a/mm/mincore.c b/mm/mincore.c index 3c64dcbcb3e2..0e548fbce19e 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -34,7 +34,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, present = pte && !huge_pte_none(huge_ptep_get(pte)); for (; addr != end; vec++, addr += PAGE_SIZE) *vec = present; - walk->private += (end - addr) >> PAGE_SHIFT; + walk->private = vec; #else BUG(); #endif @@ -118,8 +118,10 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } - if (pmd_trans_unstable(pmd)) + if (pmd_trans_unstable(pmd)) { + mincore_unmapped_range(addr, end, walk); return 0; + } ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); for (; addr != end; ptep++, addr += PAGE_SIZE) { @@ -168,6 +170,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec) { struct vm_area_struct *vma; + unsigned long end; int err; struct mm_walk mincore_walk = { .pmd_entry = mincore_pte_range, @@ -180,16 +183,11 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v if (!vma || addr < vma->vm_start) return -ENOMEM; mincore_walk.mm = vma->vm_mm; - - err = walk_page_vma(vma, &mincore_walk); + end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + err = walk_page_range(addr, end, &mincore_walk); if (err < 0) return err; - else { - unsigned long end; - - end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); - return (end - addr) >> PAGE_SHIFT; - } + return (end - addr) >> PAGE_SHIFT; } /*