From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f172.google.com (mail-we0-f172.google.com [74.125.82.172]) by kanga.kvack.org (Postfix) with ESMTP id E60BE6B0039 for ; Thu, 12 Jun 2014 17:48:31 -0400 (EDT) Received: by mail-we0-f172.google.com with SMTP id u57so1950739wes.31 for ; Thu, 12 Jun 2014 14:48:31 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id qi1si3747559wjc.18.2014.06.12.14.48.28 for ; Thu, 12 Jun 2014 14:48:29 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH -mm v2 00/11] pagewalk: standardize current users, move pmd locking, apply to mincore Date: Thu, 12 Jun 2014 17:48:00 -0400 Message-Id: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org This is ver.2 of page table walker patchset. I move forward on this cleanup work, and added some improvement from the previous version. Major changes are: - removed walk->skip which becomes removable due to refactoring existing users - commonalized the argments of entry handlers (pte|pmd|hugetlb)_entry() which allows us to use the same function as multiple handlers. This patchset is based on mmotm-2014-05-21-16-57. Tree: git@github.com:Naoya-Horiguchi/linux.git Branch: mmotm-2014-05-21-16-57/page_table_walker.v2 Thanks, Naoya Horiguchi --- Summary: Naoya Horiguchi (11): pagewalk: remove pgd_entry() and pud_entry() madvise: cleanup swapin_walk_pmd_entry() memcg: separate mem_cgroup_move_charge_pte_range() pagewalk: move pmd_trans_huge_lock() from callbacks to common code pagewalk: remove mm_walk->skip pagewalk: add size to struct mm_walk pagewalk: change type of arg of callbacks pagewalk: update comment on walk_page_range() fs/proc/task_mmu.c: refactor smaps fs/proc/task_mmu.c: clean up gather_*_stats() mincore: apply page table walker on do_mincore() arch/openrisc/kernel/dma.c | 6 +- arch/powerpc/mm/subpage-prot.c | 5 +- fs/proc/task_mmu.c | 140 ++++++++--------------------- include/linux/mm.h | 21 ++--- mm/huge_memory.c | 20 ----- mm/madvise.c | 55 +++++------- mm/memcontrol.c | 170 +++++++++++++++++------------------ mm/memory.c | 5 +- mm/mempolicy.c | 15 ++-- mm/mincore.c | 195 ++++++++++++++--------------------------- mm/pagewalk.c | 143 +++++++++++++----------------- 11 files changed, 294 insertions(+), 481 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f174.google.com (mail-we0-f174.google.com [74.125.82.174]) by kanga.kvack.org (Postfix) with ESMTP id B31D66B003A for ; Thu, 12 Jun 2014 17:48:33 -0400 (EDT) Received: by mail-we0-f174.google.com with SMTP id u57so1895671wes.19 for ; Thu, 12 Jun 2014 14:48:33 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id bu8si3738236wjc.35.2014.06.12.14.48.30 for ; Thu, 12 Jun 2014 14:48:31 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH -mm v2 01/11] pagewalk: remove pgd_entry() and pud_entry() Date: Thu, 12 Jun 2014 17:48:01 -0400 Message-Id: <1402609691-13950-2-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Currently no user of page table walker sets ->pgd_entry() or ->pud_entry(), so checking their existence in each loop is just wasting CPU cycle. So let's remove it to reduce overhead. Signed-off-by: Naoya Horiguchi --- include/linux/mm.h | 6 ------ mm/pagewalk.c | 18 +----------------- 2 files changed, 1 insertion(+), 23 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/include/linux/mm.h mmotm-2014-05-21-16-57/include/linux/mm.h index 563c79ea07bd..b4aa6579f2b1 100644 --- mmotm-2014-05-21-16-57.orig/include/linux/mm.h +++ mmotm-2014-05-21-16-57/include/linux/mm.h @@ -1092,8 +1092,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, /** * mm_walk - callbacks for walk_page_range - * @pgd_entry: if set, called for each non-empty PGD (top-level) entry - * @pud_entry: if set, called for each non-empty PUD (2nd-level) entry * @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry * this handler is required to be able to handle * pmd_trans_huge() pmds. They may simply choose to @@ -1115,10 +1113,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, * (see the comment on walk_page_range() for more details) */ struct mm_walk { - int (*pgd_entry)(pgd_t *pgd, unsigned long addr, - unsigned long next, struct mm_walk *walk); - int (*pud_entry)(pud_t *pud, unsigned long addr, - unsigned long next, struct mm_walk *walk); int (*pmd_entry)(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk); int (*pte_entry)(pte_t *pte, unsigned long addr, diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c index 2eda3dbe0b52..e734f63276c2 100644 --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c +++ mmotm-2014-05-21-16-57/mm/pagewalk.c @@ -115,14 +115,6 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, continue; } - if (walk->pud_entry) { - err = walk->pud_entry(pud, addr, next, walk); - if (skip_lower_level_walking(walk)) - continue; - if (err) - break; - } - if (walk->pmd_entry || walk->pte_entry) { err = walk_pmd_range(pud, addr, next, walk); if (err) @@ -152,15 +144,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end, continue; } - if (walk->pgd_entry) { - err = walk->pgd_entry(pgd, addr, next, walk); - if (skip_lower_level_walking(walk)) - continue; - if (err) - break; - } - - if (walk->pud_entry || walk->pmd_entry || walk->pte_entry) { + if (walk->pmd_entry || walk->pte_entry) { err = walk_pud_range(pgd, addr, next, walk); if (err) break; -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f180.google.com (mail-we0-f180.google.com [74.125.82.180]) by kanga.kvack.org (Postfix) with ESMTP id 3C0496B005A for ; Thu, 12 Jun 2014 17:48:34 -0400 (EDT) Received: by mail-we0-f180.google.com with SMTP id x48so1899267wes.25 for ; Thu, 12 Jun 2014 14:48:33 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id eu5si12618559wic.47.2014.06.12.14.48.31 for ; Thu, 12 Jun 2014 14:48:32 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH -mm v2 02/11] madvise: cleanup swapin_walk_pmd_entry() Date: Thu, 12 Jun 2014 17:48:02 -0400 Message-Id: <1402609691-13950-3-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org With the recent update on page table walker, we can use common code for the walking more. Unlike many other users, this swapin_walk expects to handle swap entries. As a result we should be careful about ptl locking. Swapin operation, read_swap_cache_async(), could cause page reclaim, so we can't keep holding ptl throughout this pte loop. In order to properly handle ptl in pte_entry(), this patch adds two new members on struct mm_walk. This cleanup is necessary to get to the final form of page table walker, where we should do all caller's specific work on leaf entries (IOW, all pmd_entry() should be used for trans_pmd.) Signed-off-by: Naoya Horiguchi Cc: Hugh Dickins --- include/linux/mm.h | 4 ++++ mm/madvise.c | 54 +++++++++++++++++++++++------------------------------- mm/pagewalk.c | 11 +++++------ 3 files changed, 32 insertions(+), 37 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/include/linux/mm.h mmotm-2014-05-21-16-57/include/linux/mm.h index b4aa6579f2b1..aa832161a1ff 100644 --- mmotm-2014-05-21-16-57.orig/include/linux/mm.h +++ mmotm-2014-05-21-16-57/include/linux/mm.h @@ -1108,6 +1108,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, * @vma: vma currently walked * @skip: internal control flag which is set when we skip the lower * level entries. + * @pmd: current pmd entry + * @ptl: page table lock associated with current entry * @private: private data for callbacks' use * * (see the comment on walk_page_range() for more details) @@ -1126,6 +1128,8 @@ struct mm_walk { struct mm_struct *mm; struct vm_area_struct *vma; int skip; + pmd_t *pmd; + spinlock_t *ptl; void *private; }; diff --git mmotm-2014-05-21-16-57.orig/mm/madvise.c mmotm-2014-05-21-16-57/mm/madvise.c index a402f8fdc68e..06b390a6fbbd 100644 --- mmotm-2014-05-21-16-57.orig/mm/madvise.c +++ mmotm-2014-05-21-16-57/mm/madvise.c @@ -135,38 +135,31 @@ static long madvise_behavior(struct vm_area_struct *vma, } #ifdef CONFIG_SWAP -static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, +/* + * Assuming that page table walker holds page table lock. + */ +static int swapin_walk_pte_entry(pte_t *pte, unsigned long start, unsigned long end, struct mm_walk *walk) { - pte_t *orig_pte; - struct vm_area_struct *vma = walk->private; - unsigned long index; - - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) - return 0; - - for (index = start; index != end; index += PAGE_SIZE) { - pte_t pte; - swp_entry_t entry; - struct page *page; - spinlock_t *ptl; - - orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); - pte = *(orig_pte + ((index - start) / PAGE_SIZE)); - pte_unmap_unlock(orig_pte, ptl); - - if (pte_present(pte) || pte_none(pte) || pte_file(pte)) - continue; - entry = pte_to_swp_entry(pte); - if (unlikely(non_swap_entry(entry))) - continue; - - page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, - vma, index); - if (page) - page_cache_release(page); - } + pte_t ptent; + pte_t *orig_pte = pte - ((start & (PMD_SIZE - 1)) >> PAGE_SHIFT); + swp_entry_t entry; + struct page *page; + ptent = *pte; + pte_unmap_unlock(orig_pte, walk->ptl); + if (pte_present(ptent) || pte_none(ptent) || pte_file(ptent)) + goto lock; + entry = pte_to_swp_entry(ptent); + if (unlikely(non_swap_entry(entry))) + goto lock; + page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, + walk->vma, start); + if (page) + page_cache_release(page); +lock: + pte_offset_map(walk->pmd, start & PMD_MASK); + spin_lock(walk->ptl); return 0; } @@ -175,8 +168,7 @@ static void force_swapin_readahead(struct vm_area_struct *vma, { struct mm_walk walk = { .mm = vma->vm_mm, - .pmd_entry = swapin_walk_pmd_entry, - .private = vma, + .pte_entry = swapin_walk_pte_entry, }; walk_page_range(start, end, &walk); diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c index e734f63276c2..24311d6f5c20 100644 --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c +++ mmotm-2014-05-21-16-57/mm/pagewalk.c @@ -27,10 +27,10 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, struct mm_struct *mm = walk->mm; pte_t *pte; pte_t *orig_pte; - spinlock_t *ptl; int err = 0; - orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + walk->pmd = pmd; + orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &walk->ptl); do { if (pte_none(*pte)) { if (walk->pte_hole) @@ -48,7 +48,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, if (err) break; } while (pte++, addr += PAGE_SIZE, addr < end); - pte_unmap_unlock(orig_pte, ptl); + pte_unmap_unlock(orig_pte, walk->ptl); cond_resched(); return addr == end ? 0 : err; } @@ -172,7 +172,6 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, unsigned long hmask = huge_page_mask(h); pte_t *pte; int err = 0; - spinlock_t *ptl; do { next = hugetlb_entry_end(h, addr, end); @@ -186,14 +185,14 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, break; continue; } - ptl = huge_pte_lock(h, mm, pte); + walk->ptl = huge_pte_lock(h, mm, pte); /* * Callers should have their own way to handle swap entries * in walk->hugetlb_entry(). */ if (walk->hugetlb_entry) err = walk->hugetlb_entry(pte, addr, next, walk); - spin_unlock(ptl); + spin_unlock(walk->ptl); if (err) break; } while (addr = next, addr != end); -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com [209.85.212.171]) by kanga.kvack.org (Postfix) with ESMTP id 78AA76B003C for ; Thu, 12 Jun 2014 17:48:34 -0400 (EDT) Received: by mail-wi0-f171.google.com with SMTP id n15so7744236wiw.16 for ; Thu, 12 Jun 2014 14:48:34 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id by5si3694979wjc.114.2014.06.12.14.48.31 for ; Thu, 12 Jun 2014 14:48:32 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH -mm v2 03/11] memcg: separate mem_cgroup_move_charge_pte_range() Date: Thu, 12 Jun 2014 17:48:03 -0400 Message-Id: <1402609691-13950-4-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org mem_cgroup_move_charge_pte_range() handles both pte and pmd, which is not standardized, so let's cleanup it. One tricky part is the retry, which is performed when we detect !mc.precharge. In such case we retry the same entry, so we don't have to go outside the pte loop. With rewriting this retry in the pte loop, we can separate pmd_entry() and pte_entry(), which is what we need. Signed-off-by: Naoya Horiguchi --- mm/memcontrol.c | 127 +++++++++++++++++++++++++++++--------------------------- 1 file changed, 66 insertions(+), 61 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/mm/memcontrol.c mmotm-2014-05-21-16-57/mm/memcontrol.c index 6970857ba0c8..01a66a208769 100644 --- mmotm-2014-05-21-16-57.orig/mm/memcontrol.c +++ mmotm-2014-05-21-16-57/mm/memcontrol.c @@ -6881,14 +6881,72 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys_state *css, mem_cgroup_clear_mc(); } -static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, +static int mem_cgroup_move_charge_pte(pte_t *pte, unsigned long addr, unsigned long end, struct mm_walk *walk) { int ret = 0; struct vm_area_struct *vma = walk->vma; - pte_t *pte; - spinlock_t *ptl; + union mc_target target; + struct page *page; + struct page_cgroup *pc; + swp_entry_t ent; + +retry: + if (!mc.precharge) { + pte_t *orig_pte = pte - ((addr & (PMD_SIZE - 1)) >> PAGE_SHIFT); + pte_unmap_unlock(orig_pte, walk->ptl); + cond_resched(); + /* + * We have consumed all precharges we got in can_attach(). + * We try charge one by one, but don't do any additional + * charges to mc.to if we have failed in charge once in attach() + * phase. + */ + ret = mem_cgroup_do_precharge(1); + pte_offset_map(walk->pmd, addr & PMD_MASK); + spin_lock(walk->ptl); + if (!ret) + goto retry; + return ret; + } + + switch (get_mctgt_type(vma, addr, *pte, &target)) { + case MC_TARGET_PAGE: + page = target.page; + if (isolate_lru_page(page)) + goto put; + pc = lookup_page_cgroup(page); + if (!mem_cgroup_move_account(page, 1, pc, + mc.from, mc.to)) { + mc.precharge--; + /* we uncharge from mc.from later. */ + mc.moved_charge++; + } + putback_lru_page(page); +put: /* get_mctgt_type() gets the page */ + put_page(page); + break; + case MC_TARGET_SWAP: + ent = target.ent; + if (!mem_cgroup_move_swap_account(ent, mc.from, mc.to)) { + mc.precharge--; + /* we fixup refcnts and charges later. */ + mc.moved_swap++; + } + break; + default: + break; + } + + return 0; +} + +static int mem_cgroup_move_charge_pmd(pmd_t *pmd, + unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + struct vm_area_struct *vma = walk->vma; enum mc_target_type target_type; union mc_target target; struct page *page; @@ -6924,71 +6982,18 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, put_page(page); } spin_unlock(ptl); - return 0; - } - - if (pmd_trans_unstable(pmd)) - return 0; -retry: - pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); - for (; addr != end; addr += PAGE_SIZE) { - pte_t ptent = *(pte++); - swp_entry_t ent; - - if (!mc.precharge) - break; - - switch (get_mctgt_type(vma, addr, ptent, &target)) { - case MC_TARGET_PAGE: - page = target.page; - if (isolate_lru_page(page)) - goto put; - pc = lookup_page_cgroup(page); - if (!mem_cgroup_move_account(page, 1, pc, - mc.from, mc.to)) { - mc.precharge--; - /* we uncharge from mc.from later. */ - mc.moved_charge++; - } - putback_lru_page(page); -put: /* get_mctgt_type() gets the page */ - put_page(page); - break; - case MC_TARGET_SWAP: - ent = target.ent; - if (!mem_cgroup_move_swap_account(ent, mc.from, mc.to)) { - mc.precharge--; - /* we fixup refcnts and charges later. */ - mc.moved_swap++; - } - break; - default: - break; - } - } - pte_unmap_unlock(pte - 1, ptl); - cond_resched(); - - if (addr != end) { - /* - * We have consumed all precharges we got in can_attach(). - * We try charge one by one, but don't do any additional - * charges to mc.to if we have failed in charge once in attach() - * phase. - */ - ret = mem_cgroup_do_precharge(1); - if (!ret) - goto retry; + /* don't call mem_cgroup_move_charge_pte() */ + walk->skip = 1; } - - return ret; + return 0; } static void mem_cgroup_move_charge(struct mm_struct *mm) { struct vm_area_struct *vma; struct mm_walk mem_cgroup_move_charge_walk = { - .pmd_entry = mem_cgroup_move_charge_pte_range, + .pmd_entry = mem_cgroup_move_charge_pmd, + .pte_entry = mem_cgroup_move_charge_pte, .mm = mm, }; -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f173.google.com (mail-qc0-f173.google.com [209.85.216.173]) by kanga.kvack.org (Postfix) with ESMTP id 4D2E06B005C for ; Thu, 12 Jun 2014 17:48:36 -0400 (EDT) Received: by mail-qc0-f173.google.com with SMTP id l6so2985595qcy.32 for ; Thu, 12 Jun 2014 14:48:36 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id e10si2587111qai.96.2014.06.12.14.48.35 for ; Thu, 12 Jun 2014 14:48:35 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH -mm v2 05/11] pagewalk: remove mm_walk->skip Date: Thu, 12 Jun 2014 17:48:05 -0400 Message-Id: <1402609691-13950-6-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Due to the relocation of pmd locking, mm_walk->skip becomes less important because only walk_page_test() and walk->test_walk() use it. None of these functions uses a positive value as a return value, so we can define it to determine whether we skip the current vma or not. Thus this patch removes mm_walk->skip. Signed-off-by: Naoya Horiguchi --- fs/proc/task_mmu.c | 4 ++-- include/linux/mm.h | 3 --- mm/mempolicy.c | 9 ++++----- mm/pagewalk.c | 36 ++++++++---------------------------- 4 files changed, 14 insertions(+), 38 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c mmotm-2014-05-21-16-57/fs/proc/task_mmu.c index 059206ea3c6b..8211f6c8236d 100644 --- mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c +++ mmotm-2014-05-21-16-57/fs/proc/task_mmu.c @@ -755,9 +755,9 @@ static int clear_refs_test_walk(unsigned long start, unsigned long end, * Writing 4 to /proc/pid/clear_refs affects all pages. */ if (cp->type == CLEAR_REFS_ANON && vma->vm_file) - walk->skip = 1; + return 1; if (cp->type == CLEAR_REFS_MAPPED && !vma->vm_file) - walk->skip = 1; + return 1; if (cp->type == CLEAR_REFS_SOFT_DIRTY) { if (vma->vm_flags & VM_SOFTDIRTY) vma->vm_flags &= ~VM_SOFTDIRTY; diff --git mmotm-2014-05-21-16-57.orig/include/linux/mm.h mmotm-2014-05-21-16-57/include/linux/mm.h index aa832161a1ff..0a20674c84e2 100644 --- mmotm-2014-05-21-16-57.orig/include/linux/mm.h +++ mmotm-2014-05-21-16-57/include/linux/mm.h @@ -1106,8 +1106,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, * right now." 0 means "skip the current vma." * @mm: mm_struct representing the target process of page table walk * @vma: vma currently walked - * @skip: internal control flag which is set when we skip the lower - * level entries. * @pmd: current pmd entry * @ptl: page table lock associated with current entry * @private: private data for callbacks' use @@ -1127,7 +1125,6 @@ struct mm_walk { struct mm_walk *walk); struct mm_struct *mm; struct vm_area_struct *vma; - int skip; pmd_t *pmd; spinlock_t *ptl; void *private; diff --git mmotm-2014-05-21-16-57.orig/mm/mempolicy.c mmotm-2014-05-21-16-57/mm/mempolicy.c index cf3b995b21d0..b8267f753748 100644 --- mmotm-2014-05-21-16-57.orig/mm/mempolicy.c +++ mmotm-2014-05-21-16-57/mm/mempolicy.c @@ -596,22 +596,21 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end, } qp->prev = vma; - walk->skip = 1; if (vma->vm_flags & VM_PFNMAP) - return 0; + return 1; if (flags & MPOL_MF_LAZY) { change_prot_numa(vma, start, endvma); - return 0; + return 1; } if ((flags & MPOL_MF_STRICT) || ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) && vma_migratable(vma))) /* queue pages from current vma */ - walk->skip = 0; - return 0; + return 0; + return 1; } /* diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c index f1a3417d0b51..61d6bd9545d6 100644 --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c +++ mmotm-2014-05-21-16-57/mm/pagewalk.c @@ -3,24 +3,6 @@ #include #include -/* - * Check the current skip status of page table walker. - * - * Here what I mean by skip is to skip lower level walking, and that was - * determined for each entry independently. For example, when walk_pmd_range - * handles a pmd_trans_huge we don't have to walk over ptes under that pmd, - * and the skipping does not affect the walking over ptes under other pmds. - * That's why we reset @walk->skip after tested. - */ -static bool skip_lower_level_walking(struct mm_walk *walk) -{ - if (walk->skip) { - walk->skip = 0; - return true; - } - return false; -} - static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -89,8 +71,6 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, err = walk->pmd_entry(pmd, addr, next, walk); spin_unlock(walk->ptl); } - if (skip_lower_level_walking(walk)) - continue; if (err) break; } @@ -225,9 +205,9 @@ static inline int walk_hugetlb_range(unsigned long addr, unsigned long end, /* * Decide whether we really walk over the current vma on [@start, @end) - * or skip it. When we skip it, we set @walk->skip to 1. - * The return value is used to control the page table walking to - * continue (for zero) or not (for non-zero). + * or skip it via the returned value. Return 0 if we do walk over the + * current vma, and return 1 if we skip the vma. Negative values means + * error, where we abort the current walk. * * Default check (only VM_PFNMAP check for now) is used when the caller * doesn't define test_walk() callback. @@ -245,7 +225,7 @@ static int walk_page_test(unsigned long start, unsigned long end, * page backing a VM_PFNMAP range. See also commit a9ff785e4437. */ if (vma->vm_flags & VM_PFNMAP) - walk->skip = 1; + return 1; return 0; } @@ -330,9 +310,9 @@ int walk_page_range(unsigned long start, unsigned long end, next = min(end, vma->vm_end); err = walk_page_test(start, next, walk); - if (skip_lower_level_walking(walk)) + if (err == 1) continue; - if (err) + if (err < 0) break; } err = __walk_page_range(start, next, walk); @@ -353,9 +333,9 @@ int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk) VM_BUG_ON(!vma); walk->vma = vma; err = walk_page_test(vma->vm_start, vma->vm_end, walk); - if (skip_lower_level_walking(walk)) + if (err == 1) return 0; - if (err) + if (err < 0) return err; return __walk_page_range(vma->vm_start, vma->vm_end, walk); } -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f42.google.com (mail-wg0-f42.google.com [74.125.82.42]) by kanga.kvack.org (Postfix) with ESMTP id 52DFF6B0062 for ; Thu, 12 Jun 2014 17:48:36 -0400 (EDT) Received: by mail-wg0-f42.google.com with SMTP id z12so1864422wgg.25 for ; Thu, 12 Jun 2014 14:48:35 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id sh2si28987530wic.40.2014.06.12.14.48.34 for ; Thu, 12 Jun 2014 14:48:35 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH -mm v2 04/11] pagewalk: move pmd_trans_huge_lock() from callbacks to common code Date: Thu, 12 Jun 2014 17:48:04 -0400 Message-Id: <1402609691-13950-5-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Now all of current users of page table walker are canonicalized, i.e. pmd_entry() handles only trans_pmd entry, and pte_entry() handles pte entry. So we can factorize common code more. This patch moves pmd_trans_huge_lock() in each pmd_entry() to pagewalk core. ChangeLog v2: - add null check walk->vma in walk_pmd_range() - move comment update into a separate patch Signed-off-by: Naoya Horiguchi --- arch/powerpc/mm/subpage-prot.c | 2 ++ fs/proc/task_mmu.c | 66 ++++++++++++++---------------------------- mm/memcontrol.c | 55 ++++++++++------------------------- mm/pagewalk.c | 18 ++++++++++-- 4 files changed, 55 insertions(+), 86 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/arch/powerpc/mm/subpage-prot.c mmotm-2014-05-21-16-57/arch/powerpc/mm/subpage-prot.c index fa9fb5b4c66c..d0d94ac606f3 100644 --- mmotm-2014-05-21-16-57.orig/arch/powerpc/mm/subpage-prot.c +++ mmotm-2014-05-21-16-57/arch/powerpc/mm/subpage-prot.c @@ -135,7 +135,9 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct vm_area_struct *vma = walk->vma; + spin_unlock(walk->ptl); split_huge_page_pmd(vma, addr, pmd); + spin_lock(walk->ptl); return 0; } diff --git mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c mmotm-2014-05-21-16-57/fs/proc/task_mmu.c index fa6d6a4e85b3..059206ea3c6b 100644 --- mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c +++ mmotm-2014-05-21-16-57/fs/proc/task_mmu.c @@ -496,15 +496,8 @@ static int smaps_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct mem_size_stats *mss = walk->private; - spinlock_t *ptl; - - if (pmd_trans_huge_lock(pmd, walk->vma, &ptl) == 1) { - smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk); - spin_unlock(ptl); - mss->anonymous_thp += HPAGE_PMD_SIZE; - /* don't call smaps_pte() */ - walk->skip = 1; - } + smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk); + mss->anonymous_thp += HPAGE_PMD_SIZE; return 0; } @@ -983,31 +976,21 @@ static int pagemap_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, struct vm_area_struct *vma = walk->vma; struct pagemapread *pm = walk->private; pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); - spinlock_t *ptl; - - if (!vma) - return err; - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { - int pmd_flags2; + int pmd_flags2; - if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd)) - pmd_flags2 = __PM_SOFT_DIRTY; - else - pmd_flags2 = 0; + if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd)) + pmd_flags2 = __PM_SOFT_DIRTY; + else + pmd_flags2 = 0; - for (; addr != end; addr += PAGE_SIZE) { - unsigned long offset; + for (; addr != end; addr += PAGE_SIZE) { + unsigned long offset; - offset = (addr & ~PAGEMAP_WALK_MASK) >> - PAGE_SHIFT; - thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2); - err = add_to_pagemap(addr, &pme, pm); - if (err) - break; - } - spin_unlock(ptl); - /* don't call pagemap_pte() */ - walk->skip = 1; + offset = (addr & ~PAGEMAP_WALK_MASK) >> PAGE_SHIFT; + thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2); + err = add_to_pagemap(addr, &pme, pm); + if (err) + break; } return err; } @@ -1277,20 +1260,13 @@ static int gather_pmd_stats(pmd_t *pmd, unsigned long addr, { struct numa_maps *md = walk->private; struct vm_area_struct *vma = walk->vma; - spinlock_t *ptl; - - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { - pte_t huge_pte = *(pte_t *)pmd; - struct page *page; - - page = can_gather_numa_stats(huge_pte, vma, addr); - if (page) - gather_stats(page, md, pte_dirty(huge_pte), - HPAGE_PMD_SIZE/PAGE_SIZE); - spin_unlock(ptl); - /* don't call gather_pte_stats() */ - walk->skip = 1; - } + pte_t huge_pte = *(pte_t *)pmd; + struct page *page; + + page = can_gather_numa_stats(huge_pte, vma, addr); + if (page) + gather_stats(page, md, pte_dirty(huge_pte), + HPAGE_PMD_SIZE/PAGE_SIZE); return 0; } #ifdef CONFIG_HUGETLB_PAGE diff --git mmotm-2014-05-21-16-57.orig/mm/memcontrol.c mmotm-2014-05-21-16-57/mm/memcontrol.c index 01a66a208769..bb987cb9e043 100644 --- mmotm-2014-05-21-16-57.orig/mm/memcontrol.c +++ mmotm-2014-05-21-16-57/mm/memcontrol.c @@ -6723,15 +6723,9 @@ static int mem_cgroup_count_precharge_pmd(pmd_t *pmd, struct mm_walk *walk) { struct vm_area_struct *vma = walk->vma; - spinlock_t *ptl; - - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { - if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) - mc.precharge += HPAGE_PMD_NR; - spin_unlock(ptl); - /* don't call mem_cgroup_count_precharge_pte() */ - walk->skip = 1; - } + + if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) + mc.precharge += HPAGE_PMD_NR; return 0; } @@ -6952,38 +6946,21 @@ static int mem_cgroup_move_charge_pmd(pmd_t *pmd, struct page *page; struct page_cgroup *pc; - /* - * We don't take compound_lock() here but no race with splitting thp - * happens because: - * - if pmd_trans_huge_lock() returns 1, the relevant thp is not - * under splitting, which means there's no concurrent thp split, - * - if another thread runs into split_huge_page() just after we - * entered this if-block, the thread must wait for page table lock - * to be unlocked in __split_huge_page_splitting(), where the main - * part of thp split is not executed yet. - */ - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { - if (mc.precharge < HPAGE_PMD_NR) { - spin_unlock(ptl); - return 0; - } - target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); - if (target_type == MC_TARGET_PAGE) { - page = target.page; - if (!isolate_lru_page(page)) { - pc = lookup_page_cgroup(page); - if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, - pc, mc.from, mc.to)) { - mc.precharge -= HPAGE_PMD_NR; - mc.moved_charge += HPAGE_PMD_NR; - } - putback_lru_page(page); + if (mc.precharge < HPAGE_PMD_NR) + return 0; + target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); + if (target_type == MC_TARGET_PAGE) { + page = target.page; + if (!isolate_lru_page(page)) { + pc = lookup_page_cgroup(page); + if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, + pc, mc.from, mc.to)) { + mc.precharge -= HPAGE_PMD_NR; + mc.moved_charge += HPAGE_PMD_NR; } - put_page(page); + putback_lru_page(page); } - spin_unlock(ptl); - /* don't call mem_cgroup_move_charge_pte() */ - walk->skip = 1; + put_page(page); } return 0; } diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c index 24311d6f5c20..f1a3417d0b51 100644 --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c +++ mmotm-2014-05-21-16-57/mm/pagewalk.c @@ -73,8 +73,22 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, continue; } - if (walk->pmd_entry) { - err = walk->pmd_entry(pmd, addr, next, walk); + /* + * We don't take compound_lock() here but no race with splitting + * thp happens because: + * - if pmd_trans_huge_lock() returns 1, the relevant thp is + * not under splitting, which means there's no concurrent + * thp split, + * - if another thread runs into split_huge_page() just after + * we entered this if-block, the thread must wait for page + * table lock to be unlocked in __split_huge_page_splitting(), + * where the main part of thp split is not executed yet. + */ + if (walk->pmd_entry && walk->vma) { + if (pmd_trans_huge_lock(pmd, walk->vma, &walk->ptl) == 1) { + err = walk->pmd_entry(pmd, addr, next, walk); + spin_unlock(walk->ptl); + } if (skip_lower_level_walking(walk)) continue; if (err) -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qa0-f46.google.com (mail-qa0-f46.google.com [209.85.216.46]) by kanga.kvack.org (Postfix) with ESMTP id D04986B006E for ; Thu, 12 Jun 2014 17:48:36 -0400 (EDT) Received: by mail-qa0-f46.google.com with SMTP id i13so2376385qae.33 for ; Thu, 12 Jun 2014 14:48:36 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id s2si2597003qak.63.2014.06.12.14.48.36 for ; Thu, 12 Jun 2014 14:48:36 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH -mm v2 07/11] pagewalk: change type of arg of callbacks Date: Thu, 12 Jun 2014 17:48:07 -0400 Message-Id: <1402609691-13950-8-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Page table walker focuses on the leaf entries, and in some situation the caller is interested only in the size of entry (not in the details of pages pointed to by the entry.) Then it's helpful to share callback functions between different levels. For this purpose this patch changes args in callback functions and let them get the pointer of the entry in type of (void *). Signed-off-by: Naoya Horiguchi --- arch/openrisc/kernel/dma.c | 6 ++++-- arch/powerpc/mm/subpage-prot.c | 3 ++- fs/proc/task_mmu.c | 31 +++++++++++++++++++------------ include/linux/mm.h | 6 +++--- mm/madvise.c | 3 ++- mm/memcontrol.c | 12 ++++++++---- mm/mempolicy.c | 6 ++++-- 7 files changed, 42 insertions(+), 25 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/arch/openrisc/kernel/dma.c mmotm-2014-05-21-16-57/arch/openrisc/kernel/dma.c index 0b77ddb1ee07..a2983e8f6f04 100644 --- mmotm-2014-05-21-16-57.orig/arch/openrisc/kernel/dma.c +++ mmotm-2014-05-21-16-57/arch/openrisc/kernel/dma.c @@ -29,9 +29,10 @@ #include static int -page_set_nocache(pte_t *pte, unsigned long addr, +page_set_nocache(void *entry, unsigned long addr, unsigned long next, struct mm_walk *walk) { + pte_t *pte = entry; unsigned long cl; pte_val(*pte) |= _PAGE_CI; @@ -50,9 +51,10 @@ page_set_nocache(pte_t *pte, unsigned long addr, } static int -page_clear_nocache(pte_t *pte, unsigned long addr, +page_clear_nocache(void *entry, unsigned long addr, unsigned long next, struct mm_walk *walk) { + pte_t *pte = entry; pte_val(*pte) &= ~_PAGE_CI; /* diff --git mmotm-2014-05-21-16-57.orig/arch/powerpc/mm/subpage-prot.c mmotm-2014-05-21-16-57/arch/powerpc/mm/subpage-prot.c index d0d94ac606f3..d62e9adc93fb 100644 --- mmotm-2014-05-21-16-57.orig/arch/powerpc/mm/subpage-prot.c +++ mmotm-2014-05-21-16-57/arch/powerpc/mm/subpage-prot.c @@ -131,9 +131,10 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len) } #ifdef CONFIG_TRANSPARENT_HUGEPAGE -static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr, +static int subpage_walk_pmd_entry(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pmd_t *pmd = entry; struct vm_area_struct *vma = walk->vma; spin_unlock(walk->ptl); split_huge_page_pmd(vma, addr, pmd); diff --git mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c mmotm-2014-05-21-16-57/fs/proc/task_mmu.c index 8211f6c8236d..a750d0842875 100644 --- mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c +++ mmotm-2014-05-21-16-57/fs/proc/task_mmu.c @@ -437,9 +437,10 @@ struct mem_size_stats { u64 pss; }; -static int smaps_pte(pte_t *pte, unsigned long addr, unsigned long end, +static int smaps_pte(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pte_t *pte = entry; struct mem_size_stats *mss = walk->private; struct vm_area_struct *vma = walk->vma; pgoff_t pgoff = linear_page_index(vma, addr); @@ -492,11 +493,11 @@ static int smaps_pte(pte_t *pte, unsigned long addr, unsigned long end, return 0; } -static int smaps_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, +static int smaps_pmd(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct mem_size_stats *mss = walk->private; - smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk); + smaps_pte(entry, addr, end, walk); mss->anonymous_thp += HPAGE_PMD_SIZE; return 0; } @@ -720,9 +721,10 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, #endif } -static int clear_refs_pte(pte_t *pte, unsigned long addr, +static int clear_refs_pte(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pte_t *pte = entry; struct clear_refs_private *cp = walk->private; struct vm_area_struct *vma = walk->vma; struct page *page; @@ -954,9 +956,10 @@ static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemap } #endif -static int pagemap_pte(pte_t *pte, unsigned long addr, unsigned long end, +static int pagemap_pte(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pte_t *pte = entry; struct vm_area_struct *vma = walk->vma; struct pagemapread *pm = walk->private; pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); @@ -969,10 +972,11 @@ static int pagemap_pte(pte_t *pte, unsigned long addr, unsigned long end, return add_to_pagemap(addr, &pme, pm); } -static int pagemap_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, +static int pagemap_pmd(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { int err = 0; + pmd_t *pmd = entry; struct vm_area_struct *vma = walk->vma; struct pagemapread *pm = walk->private; pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); @@ -1009,9 +1013,10 @@ static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread * } /* This function walks within one hugetlb entry in the single call */ -static int pagemap_hugetlb(pte_t *pte, unsigned long addr, unsigned long end, +static int pagemap_hugetlb(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pte_t *pte = entry; struct pagemapread *pm = walk->private; struct vm_area_struct *vma = walk->vma; int err = 0; @@ -1243,9 +1248,10 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma, return page; } -static int gather_pte_stats(pte_t *pte, unsigned long addr, +static int gather_pte_stats(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pte_t *pte = entry; struct numa_maps *md = walk->private; struct page *page = can_gather_numa_stats(*pte, walk->vma, addr); @@ -1255,12 +1261,12 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr, return 0; } -static int gather_pmd_stats(pmd_t *pmd, unsigned long addr, +static int gather_pmd_stats(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct numa_maps *md = walk->private; struct vm_area_struct *vma = walk->vma; - pte_t huge_pte = *(pte_t *)pmd; + pte_t huge_pte = *(pte_t *)entry; struct page *page; page = can_gather_numa_stats(huge_pte, vma, addr); @@ -1270,9 +1276,10 @@ static int gather_pmd_stats(pmd_t *pmd, unsigned long addr, return 0; } #ifdef CONFIG_HUGETLB_PAGE -static int gather_hugetlb_stats(pte_t *pte, unsigned long addr, +static int gather_hugetlb_stats(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pte_t *pte = entry; struct numa_maps *md; struct page *page; @@ -1292,7 +1299,7 @@ static int gather_hugetlb_stats(pte_t *pte, unsigned long addr, } #else -static int gather_hugetlb_stats(pte_t *pte, unsigned long addr, +static int gather_hugetlb_stats(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { return 0; diff --git mmotm-2014-05-21-16-57.orig/include/linux/mm.h mmotm-2014-05-21-16-57/include/linux/mm.h index cbe17d9cbd7f..08c2a128dd5c 100644 --- mmotm-2014-05-21-16-57.orig/include/linux/mm.h +++ mmotm-2014-05-21-16-57/include/linux/mm.h @@ -1114,13 +1114,13 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, * (see the comment on walk_page_range() for more details) */ struct mm_walk { - int (*pmd_entry)(pmd_t *pmd, unsigned long addr, + int (*pmd_entry)(void *entry, unsigned long addr, unsigned long next, struct mm_walk *walk); - int (*pte_entry)(pte_t *pte, unsigned long addr, + int (*pte_entry)(void *entry, unsigned long addr, unsigned long next, struct mm_walk *walk); int (*pte_hole)(unsigned long addr, unsigned long next, struct mm_walk *walk); - int (*hugetlb_entry)(pte_t *pte, unsigned long addr, + int (*hugetlb_entry)(void *entry, unsigned long addr, unsigned long next, struct mm_walk *walk); int (*test_walk)(unsigned long addr, unsigned long next, struct mm_walk *walk); diff --git mmotm-2014-05-21-16-57.orig/mm/madvise.c mmotm-2014-05-21-16-57/mm/madvise.c index 06b390a6fbbd..df664fcbd443 100644 --- mmotm-2014-05-21-16-57.orig/mm/madvise.c +++ mmotm-2014-05-21-16-57/mm/madvise.c @@ -138,9 +138,10 @@ static long madvise_behavior(struct vm_area_struct *vma, /* * Assuming that page table walker holds page table lock. */ -static int swapin_walk_pte_entry(pte_t *pte, unsigned long start, +static int swapin_walk_pte_entry(void *ent, unsigned long start, unsigned long end, struct mm_walk *walk) { + pte_t *pte = ent; pte_t ptent; pte_t *orig_pte = pte - ((start & (PMD_SIZE - 1)) >> PAGE_SHIFT); swp_entry_t entry; diff --git mmotm-2014-05-21-16-57.orig/mm/memcontrol.c mmotm-2014-05-21-16-57/mm/memcontrol.c index bb987cb9e043..7d62b6778a5b 100644 --- mmotm-2014-05-21-16-57.orig/mm/memcontrol.c +++ mmotm-2014-05-21-16-57/mm/memcontrol.c @@ -6709,19 +6709,21 @@ static inline enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma, } #endif -static int mem_cgroup_count_precharge_pte(pte_t *pte, +static int mem_cgroup_count_precharge_pte(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pte_t *pte = entry; if (get_mctgt_type(walk->vma, addr, *pte, NULL)) mc.precharge++; /* increment precharge temporarily */ return 0; } -static int mem_cgroup_count_precharge_pmd(pmd_t *pmd, +static int mem_cgroup_count_precharge_pmd(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pmd_t *pmd = entry; struct vm_area_struct *vma = walk->vma; if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) @@ -6875,11 +6877,12 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys_state *css, mem_cgroup_clear_mc(); } -static int mem_cgroup_move_charge_pte(pte_t *pte, +static int mem_cgroup_move_charge_pte(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { int ret = 0; + pte_t *pte = entry; struct vm_area_struct *vma = walk->vma; union mc_target target; struct page *page; @@ -6936,10 +6939,11 @@ put: /* get_mctgt_type() gets the page */ return 0; } -static int mem_cgroup_move_charge_pmd(pmd_t *pmd, +static int mem_cgroup_move_charge_pmd(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pmd_t *pmd = entry; struct vm_area_struct *vma = walk->vma; enum mc_target_type target_type; union mc_target target; diff --git mmotm-2014-05-21-16-57.orig/mm/mempolicy.c mmotm-2014-05-21-16-57/mm/mempolicy.c index b8267f753748..f74cacb36b95 100644 --- mmotm-2014-05-21-16-57.orig/mm/mempolicy.c +++ mmotm-2014-05-21-16-57/mm/mempolicy.c @@ -490,9 +490,10 @@ struct queue_pages { * Scan through pages checking if pages follow certain conditions, * and move them to the pagelist if they do. */ -static int queue_pages_pte(pte_t *pte, unsigned long addr, +static int queue_pages_pte(void *entry, unsigned long addr, unsigned long next, struct mm_walk *walk) { + pte_t *pte = entry; struct vm_area_struct *vma = walk->vma; struct page *page; struct queue_pages *qp = walk->private; @@ -519,10 +520,11 @@ static int queue_pages_pte(pte_t *pte, unsigned long addr, return 0; } -static int queue_pages_hugetlb(pte_t *pte, unsigned long addr, +static int queue_pages_hugetlb(void *ent, unsigned long addr, unsigned long next, struct mm_walk *walk) { #ifdef CONFIG_HUGETLB_PAGE + pte_t *pte = ent; struct queue_pages *qp = walk->private; unsigned long flags = qp->flags; int nid; -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qa0-f53.google.com (mail-qa0-f53.google.com [209.85.216.53]) by kanga.kvack.org (Postfix) with ESMTP id 0FDE66B006E for ; Thu, 12 Jun 2014 17:48:38 -0400 (EDT) Received: by mail-qa0-f53.google.com with SMTP id j15so2350867qaq.40 for ; Thu, 12 Jun 2014 14:48:37 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id h74si35628550qgd.79.2014.06.12.14.48.37 for ; Thu, 12 Jun 2014 14:48:37 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH -mm v2 08/11] pagewalk: update comment on walk_page_range() Date: Thu, 12 Jun 2014 17:48:08 -0400 Message-Id: <1402609691-13950-9-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Rewriting common code of page table walker has been done, so this patch updates the comment on walk_page_range() for the future development. Signed-off-by: Naoya Horiguchi --- mm/pagewalk.c | 55 +++++++++++++++++++++++++++---------------------------- 1 file changed, 27 insertions(+), 28 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c index b46c8882c643..626a80d4d9dd 100644 --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c +++ mmotm-2014-05-21-16-57/mm/pagewalk.c @@ -253,39 +253,38 @@ static int __walk_page_range(unsigned long start, unsigned long end, * walk_page_range - walk page table with caller specific callbacks * * Recursively walk the page table tree of the process represented by - * @walk->mm within the virtual address range [@start, @end). In walking, - * we can call caller-specific callback functions against each entry. + * @walk->mm within the virtual address range [@start, @end). During walking, + * we can call caller-specific callback functions against each leaf entry. * * Before starting to walk page table, some callers want to check whether - * they really want to walk over the vma (for example by checking vm_flags.) - * walk_page_test() and @walk->test_walk() do that check. + * they really want to walk over the current vma, typically by checking + * its vm_flags. walk_page_test() and @walk->test_walk() are used for this + * purpose. * - * If any callback returns a non-zero value, the page table walk is aborted - * immediately and the return value is propagated back to the caller. - * Note that the meaning of the positive returned value can be defined - * by the caller for its own purpose. + * Currently we have three types of possible leaf enties, pte (for normal + * pages,) pmd (for thps,) and hugetlb. We handle these three with pte_entry(), + * pmd_entry(), and hugetlb_entry(), respectively. + * If you don't set any function to some of these callbacks, the associated + * entries/pages are ignored. + * The return values of these three callbacks are commonly defined like below: + * - 0 : succeeded to handle the current entry, and if you don't reach the + * end address yet, continue to walk. + * - >0 : succeeded to handle the current entry, and return to the caller + * with caller specific value. + * - <0 : failed to handle the current entry, and return to the caller + * with error code. + * We can set the same function to different callbacks, where @walk->size + * should be helpful to know the type of entry in callbacks. * - * If the caller defines multiple callbacks in different levels, the - * callbacks are called in depth-first manner. It could happen that - * multiple callbacks are called on a address. For example if some caller - * defines test_walk(), pmd_entry(), and pte_entry(), then callbacks are - * called in the order of test_walk(), pmd_entry(), and pte_entry(). - * If you don't want to go down to lower level at some point and move to - * the next entry in the same level, you set @walk->skip to 1. - * For example if you succeed to handle some pmd entry as trans_huge entry, - * you need not call walk_pte_range() any more, so set it to avoid that. - * We can't determine whether to go down to lower level with the return - * value of the callback, because the whole range of return values (0, >0, - * and <0) are used up for other meanings. + * struct mm_walk keeps current values of some common data like vma and pmd, + * which are useful for the access from callbacks. If you want to pass some + * caller-specific data to callbacks, @walk->private should be helpful. * - * Each callback can access to the vma over which it is doing page table - * walk right now via @walk->vma. @walk->vma is set to NULL in walking - * outside a vma. If you want to access to some caller-specific data from - * callbacks, @walk->private should be helpful. - * - * The callers should hold @walk->mm->mmap_sem. Note that the lower level - * iterators can take page table lock in lowest level iteration and/or - * in split_huge_page_pmd(). + * Locking: + * Callers of walk_page_range() and walk_page_vma() should hold + * @walk->mm->mmap_sem, because these function traverse vma list and/or + * access to vma's data. And page table lock is held during running + * pmd_entry(), pte_entry(), and hugetlb_entry(). */ int walk_page_range(unsigned long start, unsigned long end, struct mm_walk *walk) -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f172.google.com (mail-wi0-f172.google.com [209.85.212.172]) by kanga.kvack.org (Postfix) with ESMTP id 023B56B006C for ; Thu, 12 Jun 2014 17:48:37 -0400 (EDT) Received: by mail-wi0-f172.google.com with SMTP id hi2so8593112wib.17 for ; Thu, 12 Jun 2014 14:48:37 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id ex4si3680185wjd.141.2014.06.12.14.48.35 for ; Thu, 12 Jun 2014 14:48:36 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH -mm v2 06/11] pagewalk: add size to struct mm_walk Date: Thu, 12 Jun 2014 17:48:06 -0400 Message-Id: <1402609691-13950-7-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org This variable is helpful if we try to share the callback function between multiple slots (for example between pte_entry() and pmd_entry()) as done in later patches. Signed-off-by: Naoya Horiguchi --- include/linux/mm.h | 2 ++ mm/pagewalk.c | 5 +++++ 2 files changed, 7 insertions(+) diff --git mmotm-2014-05-21-16-57.orig/include/linux/mm.h mmotm-2014-05-21-16-57/include/linux/mm.h index 0a20674c84e2..cbe17d9cbd7f 100644 --- mmotm-2014-05-21-16-57.orig/include/linux/mm.h +++ mmotm-2014-05-21-16-57/include/linux/mm.h @@ -1108,6 +1108,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, * @vma: vma currently walked * @pmd: current pmd entry * @ptl: page table lock associated with current entry + * @size: size of current entry * @private: private data for callbacks' use * * (see the comment on walk_page_range() for more details) @@ -1127,6 +1128,7 @@ struct mm_walk { struct vm_area_struct *vma; pmd_t *pmd; spinlock_t *ptl; + long size; void *private; }; diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c index 61d6bd9545d6..b46c8882c643 100644 --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c +++ mmotm-2014-05-21-16-57/mm/pagewalk.c @@ -11,6 +11,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, pte_t *orig_pte; int err = 0; + walk->size = PAGE_SIZE; walk->pmd = pmd; orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &walk->ptl); do { @@ -42,6 +43,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long next; int err = 0; + walk->size = PMD_SIZE; pmd = pmd_offset(pud, addr); do { again: @@ -97,6 +99,7 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long next; int err = 0; + walk->size = PUD_SIZE; pud = pud_offset(pgd, addr); do { next = pud_addr_end(addr, end); @@ -126,6 +129,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end, unsigned long next; int err = 0; + walk->size = PGDIR_SIZE; pgd = pgd_offset(walk->mm, addr); do { next = pgd_addr_end(addr, end); @@ -167,6 +171,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, pte_t *pte; int err = 0; + walk->size = huge_page_size(h); do { next = hugetlb_entry_end(h, addr, end); pte = huge_pte_offset(walk->mm, addr & hmask); -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f176.google.com (mail-we0-f176.google.com [74.125.82.176]) by kanga.kvack.org (Postfix) with ESMTP id 23BEE6B006C for ; Thu, 12 Jun 2014 17:48:42 -0400 (EDT) Received: by mail-we0-f176.google.com with SMTP id u56so1868912wes.7 for ; Thu, 12 Jun 2014 14:48:41 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id gr5si3692194wjc.118.2014.06.12.14.48.39 for ; Thu, 12 Jun 2014 14:48:40 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH -mm v2 09/11] fs/proc/task_mmu.c: refactor smaps Date: Thu, 12 Jun 2014 17:48:09 -0400 Message-Id: <1402609691-13950-10-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org smaps_(pte|pmd)() are almost the same, so let's clean them up to a single function. Signed-off-by: Naoya Horiguchi --- fs/proc/task_mmu.c | 17 ++++++----------- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c mmotm-2014-05-21-16-57/fs/proc/task_mmu.c index a750d0842875..1f2eab58ae14 100644 --- mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c +++ mmotm-2014-05-21-16-57/fs/proc/task_mmu.c @@ -437,7 +437,7 @@ struct mem_size_stats { u64 pss; }; -static int smaps_pte(void *entry, unsigned long addr, unsigned long end, +static int smaps_entry(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { pte_t *pte = entry; @@ -490,15 +490,10 @@ static int smaps_pte(void *entry, unsigned long addr, unsigned long end, mss->private_clean += ptent_size; mss->pss += (ptent_size << PSS_SHIFT); } - return 0; -} -static int smaps_pmd(void *entry, unsigned long addr, unsigned long end, - struct mm_walk *walk) -{ - struct mem_size_stats *mss = walk->private; - smaps_pte(entry, addr, end, walk); - mss->anonymous_thp += HPAGE_PMD_SIZE; + if (walk->size == PMD_SIZE) + mss->anonymous_thp += HPAGE_PMD_SIZE; + return 0; } @@ -563,8 +558,8 @@ static int show_smap(struct seq_file *m, void *v, int is_pid) struct vm_area_struct *vma = v; struct mem_size_stats mss; struct mm_walk smaps_walk = { - .pmd_entry = smaps_pmd, - .pte_entry = smaps_pte, + .pmd_entry = smaps_entry, + .pte_entry = smaps_entry, .mm = vma->vm_mm, .vma = vma, .private = &mss, -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com [209.85.212.171]) by kanga.kvack.org (Postfix) with ESMTP id BD5246B0070 for ; Thu, 12 Jun 2014 17:48:43 -0400 (EDT) Received: by mail-wi0-f171.google.com with SMTP id n15so7744387wiw.16 for ; Thu, 12 Jun 2014 14:48:43 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id n9si5157907wiz.23.2014.06.12.14.48.41 for ; Thu, 12 Jun 2014 14:48:42 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH -mm v2 10/11] fs/proc/task_mmu.c: clean up gather_*_stats() Date: Thu, 12 Jun 2014 17:48:10 -0400 Message-Id: <1402609691-13950-11-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Most code of gather_(pte|pmd|hugetlb)_stats() are duplicate, so let's clean them up with a single function. vm_normal_page() doesn't calculate pgoff correctly for hugetlbfs, so this patch also fixes it. Signed-off-by: Naoya Horiguchi --- fs/proc/task_mmu.c | 58 ++++++------------------------------------------------ mm/memory.c | 5 ++--- 2 files changed, 8 insertions(+), 55 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c mmotm-2014-05-21-16-57/fs/proc/task_mmu.c index 1f2eab58ae14..27ad736c6b30 100644 --- mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c +++ mmotm-2014-05-21-16-57/fs/proc/task_mmu.c @@ -1243,63 +1243,17 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma, return page; } -static int gather_pte_stats(void *entry, unsigned long addr, +static int gather_stats_entry(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { pte_t *pte = entry; struct numa_maps *md = walk->private; - struct page *page = can_gather_numa_stats(*pte, walk->vma, addr); - if (!page) - return 0; - gather_stats(page, md, pte_dirty(*pte), 1); - return 0; -} - -static int gather_pmd_stats(void *entry, unsigned long addr, - unsigned long end, struct mm_walk *walk) -{ - struct numa_maps *md = walk->private; - struct vm_area_struct *vma = walk->vma; - pte_t huge_pte = *(pte_t *)entry; - struct page *page; - - page = can_gather_numa_stats(huge_pte, vma, addr); if (page) - gather_stats(page, md, pte_dirty(huge_pte), - HPAGE_PMD_SIZE/PAGE_SIZE); + gather_stats(page, md, pte_dirty(*pte), + walk->size >> PAGE_SHIFT); return 0; } -#ifdef CONFIG_HUGETLB_PAGE -static int gather_hugetlb_stats(void *entry, unsigned long addr, - unsigned long end, struct mm_walk *walk) -{ - pte_t *pte = entry; - struct numa_maps *md; - struct page *page; - - if (pte_none(*pte)) - return 0; - - if (!pte_present(*pte)) - return 0; - - page = pte_page(*pte); - if (!page) - return 0; - - md = walk->private; - gather_stats(page, md, pte_dirty(*pte), 1); - return 0; -} - -#else -static int gather_hugetlb_stats(void *entry, unsigned long addr, - unsigned long end, struct mm_walk *walk) -{ - return 0; -} -#endif /* * Display pages allocated per node and memory policy via /proc. @@ -1324,9 +1278,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) /* Ensure we start with an empty set of numa_maps statistics. */ memset(md, 0, sizeof(*md)); - walk.hugetlb_entry = gather_hugetlb_stats; - walk.pmd_entry = gather_pmd_stats; - walk.pte_entry = gather_pte_stats; + walk.hugetlb_entry = gather_stats_entry; + walk.pmd_entry = gather_stats_entry; + walk.pte_entry = gather_stats_entry; walk.private = md; walk.mm = mm; walk.vma = vma; diff --git mmotm-2014-05-21-16-57.orig/mm/memory.c mmotm-2014-05-21-16-57/mm/memory.c index fd16b767dd68..7389dd04370f 100644 --- mmotm-2014-05-21-16-57.orig/mm/memory.c +++ mmotm-2014-05-21-16-57/mm/memory.c @@ -768,9 +768,8 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, return NULL; goto out; } else { - unsigned long off; - off = (addr - vma->vm_start) >> PAGE_SHIFT; - if (pfn == vma->vm_pgoff + off) + unsigned long off = linear_page_index(vma, addr); + if (pfn == off) return NULL; if (!is_cow_mapping(vma->vm_flags)) return NULL; -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f173.google.com (mail-wi0-f173.google.com [209.85.212.173]) by kanga.kvack.org (Postfix) with ESMTP id 2DF0C6B0074 for ; Thu, 12 Jun 2014 17:48:48 -0400 (EDT) Received: by mail-wi0-f173.google.com with SMTP id cc10so6267919wib.0 for ; Thu, 12 Jun 2014 14:48:47 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id z13si3730989wjr.48.2014.06.12.14.48.45 for ; Thu, 12 Jun 2014 14:48:46 -0700 (PDT) From: Naoya Horiguchi Subject: [PATCH -mm v2 11/11] mincore: apply page table walker on do_mincore() Date: Thu, 12 Jun 2014 17:48:11 -0400 Message-Id: <1402609691-13950-12-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org This patch makes do_mincore() use walk_page_vma(), which reduces many lines of code by using common page table walk code. ChangeLog v2: - change type of args of callbacks to void * - move definition of mincore_walk to the start of the function to fix compiler warning Signed-off-by: Naoya Horiguchi --- mm/huge_memory.c | 20 ------ mm/mincore.c | 195 +++++++++++++++++++------------------------------------ 2 files changed, 67 insertions(+), 148 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/mm/huge_memory.c mmotm-2014-05-21-16-57/mm/huge_memory.c index 6fd0668d4e1d..2671a9621d0e 100644 --- mmotm-2014-05-21-16-57.orig/mm/huge_memory.c +++ mmotm-2014-05-21-16-57/mm/huge_memory.c @@ -1379,26 +1379,6 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, return ret; } -int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, unsigned long end, - unsigned char *vec) -{ - spinlock_t *ptl; - int ret = 0; - - if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { - /* - * All logical pages in the range are present - * if backed by a huge page. - */ - spin_unlock(ptl); - memset(vec, 1, (end - addr) >> PAGE_SHIFT); - ret = 1; - } - - return ret; -} - int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma, unsigned long old_addr, unsigned long new_addr, unsigned long old_end, diff --git mmotm-2014-05-21-16-57.orig/mm/mincore.c mmotm-2014-05-21-16-57/mm/mincore.c index 725c80961048..d53b07548ce1 100644 --- mmotm-2014-05-21-16-57.orig/mm/mincore.c +++ mmotm-2014-05-21-16-57/mm/mincore.c @@ -19,38 +19,28 @@ #include #include -static void mincore_hugetlb_page_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, - unsigned char *vec) +static int mincore_hugetlb(void *entry, unsigned long addr, + unsigned long end, struct mm_walk *walk) { + int err = 0; #ifdef CONFIG_HUGETLB_PAGE - struct hstate *h; + pte_t *pte = entry; + unsigned char present; + unsigned char *vec = walk->private; - h = hstate_vma(vma); - while (1) { - unsigned char present; - pte_t *ptep; - /* - * Huge pages are always in RAM for now, but - * theoretically it needs to be checked. - */ - ptep = huge_pte_offset(current->mm, - addr & huge_page_mask(h)); - present = ptep && !huge_pte_none(huge_ptep_get(ptep)); - while (1) { - *vec = present; - vec++; - addr += PAGE_SIZE; - if (addr == end) - return; - /* check hugepage border */ - if (!(addr & ~huge_page_mask(h))) - break; - } - } + /* + * Hugepages under user process are always in RAM and never + * swapped out, but theoretically it needs to be checked. + */ + present = pte && !huge_pte_none(huge_ptep_get(pte)); + for (; addr != end; vec++, addr += PAGE_SIZE) + *vec = present; + cond_resched(); + walk->private += (end - addr) >> PAGE_SHIFT; #else BUG(); #endif + return err; } /* @@ -94,10 +84,11 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) return present; } -static void mincore_unmapped_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, - unsigned char *vec) +static int mincore_hole(unsigned long addr, unsigned long end, + struct mm_walk *walk) { + struct vm_area_struct *vma = walk->vma; + unsigned char *vec = walk->private; unsigned long nr = (end - addr) >> PAGE_SHIFT; int i; @@ -111,110 +102,50 @@ static void mincore_unmapped_range(struct vm_area_struct *vma, for (i = 0; i < nr; i++) vec[i] = 0; } + walk->private += nr; + return 0; } -static void mincore_pte_range(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, unsigned long end, - unsigned char *vec) +static int mincore_pte(void *entry, unsigned long addr, unsigned long end, + struct mm_walk *walk) { - unsigned long next; - spinlock_t *ptl; - pte_t *ptep; - - ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); - do { - pte_t pte = *ptep; - pgoff_t pgoff; - - next = addr + PAGE_SIZE; - if (pte_none(pte)) - mincore_unmapped_range(vma, addr, next, vec); - else if (pte_present(pte)) + pte_t *pte = entry; + struct vm_area_struct *vma = walk->vma; + unsigned char *vec = walk->private; + pgoff_t pgoff; + + if (pte_present(*pte)) + *vec = 1; + else if (pte_file(*pte)) { + pgoff = pte_to_pgoff(*pte); + *vec = mincore_page(vma->vm_file->f_mapping, pgoff); + } else { /* pte is a swap entry */ + swp_entry_t entry = pte_to_swp_entry(*pte); + + if (is_migration_entry(entry)) { + /* migration entries are always uptodate */ *vec = 1; - else if (pte_file(pte)) { - pgoff = pte_to_pgoff(pte); - *vec = mincore_page(vma->vm_file->f_mapping, pgoff); - } else { /* pte is a swap entry */ - swp_entry_t entry = pte_to_swp_entry(pte); - - if (is_migration_entry(entry)) { - /* migration entries are always uptodate */ - *vec = 1; - } else { + } else { #ifdef CONFIG_SWAP - pgoff = entry.val; - *vec = mincore_page(swap_address_space(entry), - pgoff); + pgoff = entry.val; + *vec = mincore_page(swap_address_space(entry), + pgoff); #else - WARN_ON(1); - *vec = 1; + WARN_ON(1); + *vec = 1; #endif - } - } - vec++; - } while (ptep++, addr = next, addr != end); - pte_unmap_unlock(ptep - 1, ptl); -} - -static void mincore_pmd_range(struct vm_area_struct *vma, pud_t *pud, - unsigned long addr, unsigned long end, - unsigned char *vec) -{ - unsigned long next; - pmd_t *pmd; - - pmd = pmd_offset(pud, addr); - do { - next = pmd_addr_end(addr, end); - if (pmd_trans_huge(*pmd)) { - if (mincore_huge_pmd(vma, pmd, addr, next, vec)) { - vec += (next - addr) >> PAGE_SHIFT; - continue; - } - /* fall through */ } - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) - mincore_unmapped_range(vma, addr, next, vec); - else - mincore_pte_range(vma, pmd, addr, next, vec); - vec += (next - addr) >> PAGE_SHIFT; - } while (pmd++, addr = next, addr != end); -} - -static void mincore_pud_range(struct vm_area_struct *vma, pgd_t *pgd, - unsigned long addr, unsigned long end, - unsigned char *vec) -{ - unsigned long next; - pud_t *pud; - - pud = pud_offset(pgd, addr); - do { - next = pud_addr_end(addr, end); - if (pud_none_or_clear_bad(pud)) - mincore_unmapped_range(vma, addr, next, vec); - else - mincore_pmd_range(vma, pud, addr, next, vec); - vec += (next - addr) >> PAGE_SHIFT; - } while (pud++, addr = next, addr != end); + } + walk->private++; + return 0; } -static void mincore_page_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, - unsigned char *vec) +static int mincore_pmd(void *entry, unsigned long addr, unsigned long end, + struct mm_walk *walk) { - unsigned long next; - pgd_t *pgd; - - pgd = pgd_offset(vma->vm_mm, addr); - do { - next = pgd_addr_end(addr, end); - if (pgd_none_or_clear_bad(pgd)) - mincore_unmapped_range(vma, addr, next, vec); - else - mincore_pud_range(vma, pgd, addr, next, vec); - vec += (next - addr) >> PAGE_SHIFT; - } while (pgd++, addr = next, addr != end); + memset(walk->private, 1, (end - addr) >> PAGE_SHIFT); + walk->private += (end - addr) >> PAGE_SHIFT; + return 0; } /* @@ -226,19 +157,27 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v { struct vm_area_struct *vma; unsigned long end; + int err; + struct mm_walk mincore_walk = { + .pmd_entry = mincore_pmd, + .pte_entry = mincore_pte, + .pte_hole = mincore_hole, + .hugetlb_entry = mincore_hugetlb, + .private = vec, + }; vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; - + mincore_walk.mm = vma->vm_mm; + mincore_walk.vma = vma; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); - if (is_vm_hugetlb_page(vma)) - mincore_hugetlb_page_range(vma, addr, end, vec); + err = walk_page_vma(vma, &mincore_walk); + if (err < 0) + return err; else - mincore_page_range(vma, addr, end, vec); - - return (end - addr) >> PAGE_SHIFT; + return (end - addr) >> PAGE_SHIFT; } /* -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f53.google.com (mail-pa0-f53.google.com [209.85.220.53]) by kanga.kvack.org (Postfix) with ESMTP id 7EA706B003A for ; Thu, 12 Jun 2014 17:56:19 -0400 (EDT) Received: by mail-pa0-f53.google.com with SMTP id kx10so1386092pab.40 for ; Thu, 12 Jun 2014 14:56:19 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTP id kc2si42869613pbc.148.2014.06.12.14.56.18 for ; Thu, 12 Jun 2014 14:56:18 -0700 (PDT) Date: Thu, 12 Jun 2014 14:56:17 -0700 From: Andrew Morton Subject: Re: [PATCH -mm v2 00/11] pagewalk: standardize current users, move pmd locking, apply to mincore Message-Id: <20140612145617.5debf04bd1a2978be4a1fb88@linux-foundation.org> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Naoya Horiguchi Cc: linux-mm@kvack.org, Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org On Thu, 12 Jun 2014 17:48:00 -0400 Naoya Horiguchi wrote: > This is ver.2 of page table walker patchset. > > I move forward on this cleanup work, and added some improvement from the > previous version. Major changes are: > - removed walk->skip which becomes removable due to refactoring existing > users > - commonalized the argments of entry handlers (pte|pmd|hugetlb)_entry() > which allows us to use the same function as multiple handlers. > Are you sure you didn't miss anything? mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff.patch mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v2.patch mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v3.patch mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v3-fix.patch pagewalk-update-page-table-walker-core.patch pagewalk-update-page-table-walker-core-fix-end-address-calculation-in-walk_page_range.patch pagewalk-update-page-table-walker-core-fix-end-address-calculation-in-walk_page_range-fix.patch pagewalk-update-page-table-walker-core-fix.patch pagewalk-add-walk_page_vma.patch smaps-redefine-callback-functions-for-page-table-walker.patch clear_refs-redefine-callback-functions-for-page-table-walker.patch pagemap-redefine-callback-functions-for-page-table-walker.patch pagemap-redefine-callback-functions-for-page-table-walker-fix.patch numa_maps-redefine-callback-functions-for-page-table-walker.patch memcg-redefine-callback-functions-for-page-table-walker.patch arch-powerpc-mm-subpage-protc-use-walk_page_vma-instead-of-walk_page_range.patch pagewalk-remove-argument-hmask-from-hugetlb_entry.patch pagewalk-remove-argument-hmask-from-hugetlb_entry-fix.patch pagewalk-remove-argument-hmask-from-hugetlb_entry-fix-fix.patch mempolicy-apply-page-table-walker-on-queue_pages_range.patch mm-pagewalkc-move-pte-null-check.patch mm-prom-pid-clear_refs-avoid-split_huge_page.patch # mm-pagewalk-remove-pgd_entry-and-pud_entry.patch mm-pagewalk-replace-mm_walk-skip-with-more-general-mm_walk-control.patch madvise-cleanup-swapin_walk_pmd_entry.patch memcg-separate-mem_cgroup_move_charge_pte_range.patch memcg-separate-mem_cgroup_move_charge_pte_range-checkpatch-fixes.patch arch-powerpc-mm-subpage-protc-cleanup-subpage_walk_pmd_entry.patch mm-pagewalk-move-pmd_trans_huge_lock-from-callbacks-to-common-code.patch mm-pagewalk-move-pmd_trans_huge_lock-from-callbacks-to-common-code-checkpatch-fixes.patch mincore-apply-page-table-walker-on-do_mincore.patch mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff.patch apepars to have disappeared, I didn't check the rest. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f41.google.com (mail-pa0-f41.google.com [209.85.220.41]) by kanga.kvack.org (Postfix) with ESMTP id 908A26B0031 for ; Thu, 12 Jun 2014 18:07:42 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id kq14so1397719pab.14 for ; Thu, 12 Jun 2014 15:07:42 -0700 (PDT) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTP id he6si2496691pac.20.2014.06.12.15.07.41 for ; Thu, 12 Jun 2014 15:07:41 -0700 (PDT) Message-ID: <539A248F.2090306@intel.com> Date: Thu, 12 Jun 2014 15:07:11 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH -mm v2 06/11] pagewalk: add size to struct mm_walk References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1402609691-13950-7-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-7-git-send-email-n-horiguchi@ah.jp.nec.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Naoya Horiguchi , linux-mm@kvack.org Cc: Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org On 06/12/2014 02:48 PM, Naoya Horiguchi wrote: > This variable is helpful if we try to share the callback function between > multiple slots (for example between pte_entry() and pmd_entry()) as done > in later patches. smaps_pte() already does this: static int smaps_pte(pte_t *pte, unsigned long addr, unsigned long end, struct mm_walk *walk) ... unsigned long ptent_size = end - addr; Other than the hugetlb handler, can't we always imply the size from end-addr? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f170.google.com (mail-wi0-f170.google.com [209.85.212.170]) by kanga.kvack.org (Postfix) with ESMTP id 3F1846B0035 for ; Thu, 12 Jun 2014 18:21:19 -0400 (EDT) Received: by mail-wi0-f170.google.com with SMTP id cc10so1449919wib.3 for ; Thu, 12 Jun 2014 15:21:18 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id cz6si3828162wjc.27.2014.06.12.15.21.16 for ; Thu, 12 Jun 2014 15:21:17 -0700 (PDT) Message-ID: <539a27dd.a6b2c20a.123e.ffff8f52SMTPIN_ADDED_BROKEN@mx.google.com> From: Naoya Horiguchi Subject: Re: [PATCH -mm v2 00/11] pagewalk: standardize current users, move pmd locking, apply to mincore Date: Thu, 12 Jun 2014 18:21:05 -0400 In-Reply-To: <20140612145617.5debf04bd1a2978be4a1fb88@linux-foundation.org> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> <20140612145617.5debf04bd1a2978be4a1fb88@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org On Thu, Jun 12, 2014 at 02:56:17PM -0700, Andrew Morton wrote: > On Thu, 12 Jun 2014 17:48:00 -0400 Naoya Horiguchi wrote: > > > This is ver.2 of page table walker patchset. > > > > I move forward on this cleanup work, and added some improvement from the > > previous version. Major changes are: > > - removed walk->skip which becomes removable due to refactoring existing > > users > > - commonalized the argments of entry handlers (pte|pmd|hugetlb)_entry() > > which allows us to use the same function as multiple handlers. > > > > Are you sure you didn't miss anything? > > mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff.patch > mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v2.patch > mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v3.patch > mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v3-fix.patch > pagewalk-update-page-table-walker-core.patch > pagewalk-update-page-table-walker-core-fix-end-address-calculation-in-walk_page_range.patch > pagewalk-update-page-table-walker-core-fix-end-address-calculation-in-walk_page_range-fix.patch > pagewalk-update-page-table-walker-core-fix.patch > pagewalk-add-walk_page_vma.patch > smaps-redefine-callback-functions-for-page-table-walker.patch > clear_refs-redefine-callback-functions-for-page-table-walker.patch > pagemap-redefine-callback-functions-for-page-table-walker.patch > pagemap-redefine-callback-functions-for-page-table-walker-fix.patch > numa_maps-redefine-callback-functions-for-page-table-walker.patch > memcg-redefine-callback-functions-for-page-table-walker.patch > arch-powerpc-mm-subpage-protc-use-walk_page_vma-instead-of-walk_page_range.patch > pagewalk-remove-argument-hmask-from-hugetlb_entry.patch > pagewalk-remove-argument-hmask-from-hugetlb_entry-fix.patch > pagewalk-remove-argument-hmask-from-hugetlb_entry-fix-fix.patch > mempolicy-apply-page-table-walker-on-queue_pages_range.patch > mm-pagewalkc-move-pte-null-check.patch This patchset is based on mmotm-2014-05-21-16, so supposed to be applied on top of the above patches. > mm-prom-pid-clear_refs-avoid-split_huge_page.patch I didn't assume this patch as a base, so if my patchset conflicts with it, please let me know. > # > mm-pagewalk-remove-pgd_entry-and-pud_entry.patch > mm-pagewalk-replace-mm_walk-skip-with-more-general-mm_walk-control.patch > madvise-cleanup-swapin_walk_pmd_entry.patch > memcg-separate-mem_cgroup_move_charge_pte_range.patch > memcg-separate-mem_cgroup_move_charge_pte_range-checkpatch-fixes.patch > arch-powerpc-mm-subpage-protc-cleanup-subpage_walk_pmd_entry.patch > mm-pagewalk-move-pmd_trans_huge_lock-from-callbacks-to-common-code.patch > mm-pagewalk-move-pmd_trans_huge_lock-from-callbacks-to-common-code-checkpatch-fixes.patch > mincore-apply-page-table-walker-on-do_mincore.patch Yes, the current post is the revision of these patches. And to be precise, this patchset also depends on a patch which I posted today for the fix discussed in the thread about "kmemleak: Unable to handle kernel paging request." So it might be a source of conflict. Sorry if it's confusing. Thanks, Naoya Horiguchi > > mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff.patch > apepars to have disappeared, I didn't check the rest. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f182.google.com (mail-we0-f182.google.com [74.125.82.182]) by kanga.kvack.org (Postfix) with ESMTP id 4B3056B0039 for ; Thu, 12 Jun 2014 18:36:52 -0400 (EDT) Received: by mail-we0-f182.google.com with SMTP id q59so1993643wes.13 for ; Thu, 12 Jun 2014 15:36:51 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id f13si3809642wjz.137.2014.06.12.15.36.49 for ; Thu, 12 Jun 2014 15:36:50 -0700 (PDT) Message-ID: <539a2b82.4d55c20a.4b2c.ffff905bSMTPIN_ADDED_BROKEN@mx.google.com> From: Naoya Horiguchi Subject: Re: [PATCH -mm v2 06/11] pagewalk: add size to struct mm_walk Date: Thu, 12 Jun 2014 18:36:40 -0400 In-Reply-To: <539A248F.2090306@intel.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1402609691-13950-7-git-send-email-n-horiguchi@ah.jp.nec.com> <539A248F.2090306@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: linux-mm@kvack.org, Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Hello Dave, On Thu, Jun 12, 2014 at 03:07:11PM -0700, Dave Hansen wrote: > On 06/12/2014 02:48 PM, Naoya Horiguchi wrote: > > This variable is helpful if we try to share the callback function between > > multiple slots (for example between pte_entry() and pmd_entry()) as done > > in later patches. > > smaps_pte() already does this: > > static int smaps_pte(pte_t *pte, unsigned long addr, unsigned long end, > struct mm_walk *walk) > ... > unsigned long ptent_size = end - addr; > > Other than the hugetlb handler, can't we always imply the size from > end-addr? Good point, thanks. I didn't care about this variable. Currently we call this walk via walk_page_vma() so addr and end is always between [vma->vm_start, vma->vm_end]. If a vma is not aligned to pmd-boundary, this size might have incorrect value. But using end-addr seems to cause no practical problem because in such case first or final pmd never have a thp. I'm not sure every caller (especially callers of walk_page_range()) assumes addr/end is page aligned, but walk->size approach looks safer to me. Thanks, Naoya Horiguchi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f47.google.com (mail-pb0-f47.google.com [209.85.160.47]) by kanga.kvack.org (Postfix) with ESMTP id A06A96B0031 for ; Sun, 15 Jun 2014 16:25:52 -0400 (EDT) Received: by mail-pb0-f47.google.com with SMTP id up15so745671pbc.6 for ; Sun, 15 Jun 2014 13:25:52 -0700 (PDT) Received: from mail-pa0-x233.google.com (mail-pa0-x233.google.com [2607:f8b0:400e:c03::233]) by mx.google.com with ESMTPS id ja1si8643085pbb.218.2014.06.15.13.25.51 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 15 Jun 2014 13:25:51 -0700 (PDT) Received: by mail-pa0-f51.google.com with SMTP id hz1so568828pad.38 for ; Sun, 15 Jun 2014 13:25:51 -0700 (PDT) Date: Sun, 15 Jun 2014 13:24:30 -0700 (PDT) From: Hugh Dickins Subject: Re: [PATCH -mm v2 02/11] madvise: cleanup swapin_walk_pmd_entry() In-Reply-To: <1402609691-13950-3-git-send-email-n-horiguchi@ah.jp.nec.com> Message-ID: References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1402609691-13950-3-git-send-email-n-horiguchi@ah.jp.nec.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Naoya Horiguchi Cc: linux-mm@kvack.org, Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org On Thu, 12 Jun 2014, Naoya Horiguchi wrote: > With the recent update on page table walker, we can use common code for > the walking more. Unlike many other users, this swapin_walk expects to > handle swap entries. As a result we should be careful about ptl locking. > Swapin operation, read_swap_cache_async(), could cause page reclaim, so > we can't keep holding ptl throughout this pte loop. > In order to properly handle ptl in pte_entry(), this patch adds two new > members on struct mm_walk. > > This cleanup is necessary to get to the final form of page table walker, > where we should do all caller's specific work on leaf entries (IOW, all > pmd_entry() should be used for trans_pmd.) > > Signed-off-by: Naoya Horiguchi > Cc: Hugh Dickins Sorry, I believe this (and probably several other of your conversions) is badly flawed. You have a pattern of doing pte_offset_map_lock() inside the page walker, then dropping and regetting map and lock inside your pte handler. But on, say, x86_32 with CONFIG_HIGHMEM, CONFIG_SMP and CONFIG_PREEMPT, you may be preempted then run on a different cpu while atomic kmap and lock are dropped: so that the pte pointer then used on return to walk_pte_range() will no longer correspond to the right mapping. Presumably that can be fixed by keeping the pte pointer in the mm_walk structure; but I'm not at all sure that's the right thing to do. I am not nearly so keen as you to reduce all these to per-pte callouts, which seem inefficient to me. It can be argued both ways on the less important functions (like this madvise one); but I hope you don't try to make this kind of conversion to fast paths like those in memory.c. Hugh > --- > include/linux/mm.h | 4 ++++ > mm/madvise.c | 54 +++++++++++++++++++++++------------------------------- > mm/pagewalk.c | 11 +++++------ > 3 files changed, 32 insertions(+), 37 deletions(-) > > diff --git mmotm-2014-05-21-16-57.orig/include/linux/mm.h mmotm-2014-05-21-16-57/include/linux/mm.h > index b4aa6579f2b1..aa832161a1ff 100644 > --- mmotm-2014-05-21-16-57.orig/include/linux/mm.h > +++ mmotm-2014-05-21-16-57/include/linux/mm.h > @@ -1108,6 +1108,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, > * @vma: vma currently walked > * @skip: internal control flag which is set when we skip the lower > * level entries. > + * @pmd: current pmd entry > + * @ptl: page table lock associated with current entry > * @private: private data for callbacks' use > * > * (see the comment on walk_page_range() for more details) > @@ -1126,6 +1128,8 @@ struct mm_walk { > struct mm_struct *mm; > struct vm_area_struct *vma; > int skip; > + pmd_t *pmd; > + spinlock_t *ptl; > void *private; > }; > > diff --git mmotm-2014-05-21-16-57.orig/mm/madvise.c mmotm-2014-05-21-16-57/mm/madvise.c > index a402f8fdc68e..06b390a6fbbd 100644 > --- mmotm-2014-05-21-16-57.orig/mm/madvise.c > +++ mmotm-2014-05-21-16-57/mm/madvise.c > @@ -135,38 +135,31 @@ static long madvise_behavior(struct vm_area_struct *vma, > } > > #ifdef CONFIG_SWAP > -static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, > +/* > + * Assuming that page table walker holds page table lock. > + */ > +static int swapin_walk_pte_entry(pte_t *pte, unsigned long start, > unsigned long end, struct mm_walk *walk) > { > - pte_t *orig_pte; > - struct vm_area_struct *vma = walk->private; > - unsigned long index; > - > - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) > - return 0; > - > - for (index = start; index != end; index += PAGE_SIZE) { > - pte_t pte; > - swp_entry_t entry; > - struct page *page; > - spinlock_t *ptl; > - > - orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); > - pte = *(orig_pte + ((index - start) / PAGE_SIZE)); > - pte_unmap_unlock(orig_pte, ptl); > - > - if (pte_present(pte) || pte_none(pte) || pte_file(pte)) > - continue; > - entry = pte_to_swp_entry(pte); > - if (unlikely(non_swap_entry(entry))) > - continue; > - > - page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, > - vma, index); > - if (page) > - page_cache_release(page); > - } > + pte_t ptent; > + pte_t *orig_pte = pte - ((start & (PMD_SIZE - 1)) >> PAGE_SHIFT); > + swp_entry_t entry; > + struct page *page; > > + ptent = *pte; > + pte_unmap_unlock(orig_pte, walk->ptl); > + if (pte_present(ptent) || pte_none(ptent) || pte_file(ptent)) > + goto lock; > + entry = pte_to_swp_entry(ptent); > + if (unlikely(non_swap_entry(entry))) > + goto lock; > + page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, > + walk->vma, start); > + if (page) > + page_cache_release(page); > +lock: > + pte_offset_map(walk->pmd, start & PMD_MASK); > + spin_lock(walk->ptl); > return 0; > } > > @@ -175,8 +168,7 @@ static void force_swapin_readahead(struct vm_area_struct *vma, > { > struct mm_walk walk = { > .mm = vma->vm_mm, > - .pmd_entry = swapin_walk_pmd_entry, > - .private = vma, > + .pte_entry = swapin_walk_pte_entry, > }; > > walk_page_range(start, end, &walk); > diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c > index e734f63276c2..24311d6f5c20 100644 > --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c > +++ mmotm-2014-05-21-16-57/mm/pagewalk.c > @@ -27,10 +27,10 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, > struct mm_struct *mm = walk->mm; > pte_t *pte; > pte_t *orig_pte; > - spinlock_t *ptl; > int err = 0; > > - orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); > + walk->pmd = pmd; > + orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &walk->ptl); > do { > if (pte_none(*pte)) { > if (walk->pte_hole) > @@ -48,7 +48,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, > if (err) > break; > } while (pte++, addr += PAGE_SIZE, addr < end); > - pte_unmap_unlock(orig_pte, ptl); > + pte_unmap_unlock(orig_pte, walk->ptl); > cond_resched(); > return addr == end ? 0 : err; > } > @@ -172,7 +172,6 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, > unsigned long hmask = huge_page_mask(h); > pte_t *pte; > int err = 0; > - spinlock_t *ptl; > > do { > next = hugetlb_entry_end(h, addr, end); > @@ -186,14 +185,14 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, > break; > continue; > } > - ptl = huge_pte_lock(h, mm, pte); > + walk->ptl = huge_pte_lock(h, mm, pte); > /* > * Callers should have their own way to handle swap entries > * in walk->hugetlb_entry(). > */ > if (walk->hugetlb_entry) > err = walk->hugetlb_entry(pte, addr, next, walk); > - spin_unlock(ptl); > + spin_unlock(walk->ptl); > if (err) > break; > } while (addr = next, addr != end); > -- > 1.9.3 > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f177.google.com (mail-wi0-f177.google.com [209.85.212.177]) by kanga.kvack.org (Postfix) with ESMTP id D0CAC6B0031 for ; Mon, 16 Jun 2014 12:00:01 -0400 (EDT) Received: by mail-wi0-f177.google.com with SMTP id r20so4349324wiv.10 for ; Mon, 16 Jun 2014 09:00:01 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id la3si19930002wjb.23.2014.06.16.08.59.59 for ; Mon, 16 Jun 2014 09:00:00 -0700 (PDT) Date: Mon, 16 Jun 2014 11:59:50 -0400 From: Naoya Horiguchi Subject: Re: [PATCH -mm v2 02/11] madvise: cleanup swapin_walk_pmd_entry() Message-ID: <20140616155950.GA13264@nhori.bos.redhat.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1402609691-13950-3-git-send-email-n-horiguchi@ah.jp.nec.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: linux-mm@kvack.org, Dave Hansen , Andrew Morton , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org On Sun, Jun 15, 2014 at 01:24:30PM -0700, Hugh Dickins wrote: > On Thu, 12 Jun 2014, Naoya Horiguchi wrote: > > > With the recent update on page table walker, we can use common code for > > the walking more. Unlike many other users, this swapin_walk expects to > > handle swap entries. As a result we should be careful about ptl locking. > > Swapin operation, read_swap_cache_async(), could cause page reclaim, so > > we can't keep holding ptl throughout this pte loop. > > In order to properly handle ptl in pte_entry(), this patch adds two new > > members on struct mm_walk. > > > > This cleanup is necessary to get to the final form of page table walker, > > where we should do all caller's specific work on leaf entries (IOW, all > > pmd_entry() should be used for trans_pmd.) > > > > Signed-off-by: Naoya Horiguchi > > Cc: Hugh Dickins > > Sorry, I believe this (and probably several other of your conversions) > is badly flawed. > > You have a pattern of doing pte_offset_map_lock() inside the page walker, > then dropping and regetting map and lock inside your pte handler. > > But on, say, x86_32 with CONFIG_HIGHMEM, CONFIG_SMP and CONFIG_PREEMPT, > you may be preempted then run on a different cpu while atomic kmap and > lock are dropped: so that the pte pointer then used on return to > walk_pte_range() will no longer correspond to the right mapping. Sorry, I didn't handle it correctly. > Presumably that can be fixed by keeping the pte pointer in the mm_walk > structure; but I'm not at all sure that's the right thing to do. orig_pte should be updated if we call pte_offset_map_lock() or pte_offset_map() inside the callback. So moving orig_pte into mm_walk is what I think I'll do in the next post. > I am not nearly so keen as you to reduce all these to per-pte callouts, > which seem inefficient to me. Right, we can't do inlining for callbacks, so calling callbacks more is slower than open code. To make it better, code generator which creates open page walk code for each users at build time might be one option. Standardization done in patchset is the first step for doing it. > It can be argued both ways on the less > important functions (like this madvise one); but I hope you don't try > to make this kind of conversion to fast paths like those in memory.c. OK. I never touch fast paths until performance concern is solved. Thanks, Naoya Horiguchi > Hugh > > > --- > > include/linux/mm.h | 4 ++++ > > mm/madvise.c | 54 +++++++++++++++++++++++------------------------------- > > mm/pagewalk.c | 11 +++++------ > > 3 files changed, 32 insertions(+), 37 deletions(-) > > > > diff --git mmotm-2014-05-21-16-57.orig/include/linux/mm.h mmotm-2014-05-21-16-57/include/linux/mm.h > > index b4aa6579f2b1..aa832161a1ff 100644 > > --- mmotm-2014-05-21-16-57.orig/include/linux/mm.h > > +++ mmotm-2014-05-21-16-57/include/linux/mm.h > > @@ -1108,6 +1108,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, > > * @vma: vma currently walked > > * @skip: internal control flag which is set when we skip the lower > > * level entries. > > + * @pmd: current pmd entry > > + * @ptl: page table lock associated with current entry > > * @private: private data for callbacks' use > > * > > * (see the comment on walk_page_range() for more details) > > @@ -1126,6 +1128,8 @@ struct mm_walk { > > struct mm_struct *mm; > > struct vm_area_struct *vma; > > int skip; > > + pmd_t *pmd; > > + spinlock_t *ptl; > > void *private; > > }; > > > > diff --git mmotm-2014-05-21-16-57.orig/mm/madvise.c mmotm-2014-05-21-16-57/mm/madvise.c > > index a402f8fdc68e..06b390a6fbbd 100644 > > --- mmotm-2014-05-21-16-57.orig/mm/madvise.c > > +++ mmotm-2014-05-21-16-57/mm/madvise.c > > @@ -135,38 +135,31 @@ static long madvise_behavior(struct vm_area_struct *vma, > > } > > > > #ifdef CONFIG_SWAP > > -static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, > > +/* > > + * Assuming that page table walker holds page table lock. > > + */ > > +static int swapin_walk_pte_entry(pte_t *pte, unsigned long start, > > unsigned long end, struct mm_walk *walk) > > { > > - pte_t *orig_pte; > > - struct vm_area_struct *vma = walk->private; > > - unsigned long index; > > - > > - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) > > - return 0; > > - > > - for (index = start; index != end; index += PAGE_SIZE) { > > - pte_t pte; > > - swp_entry_t entry; > > - struct page *page; > > - spinlock_t *ptl; > > - > > - orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); > > - pte = *(orig_pte + ((index - start) / PAGE_SIZE)); > > - pte_unmap_unlock(orig_pte, ptl); > > - > > - if (pte_present(pte) || pte_none(pte) || pte_file(pte)) > > - continue; > > - entry = pte_to_swp_entry(pte); > > - if (unlikely(non_swap_entry(entry))) > > - continue; > > - > > - page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, > > - vma, index); > > - if (page) > > - page_cache_release(page); > > - } > > + pte_t ptent; > > + pte_t *orig_pte = pte - ((start & (PMD_SIZE - 1)) >> PAGE_SHIFT); > > + swp_entry_t entry; > > + struct page *page; > > > > + ptent = *pte; > > + pte_unmap_unlock(orig_pte, walk->ptl); > > + if (pte_present(ptent) || pte_none(ptent) || pte_file(ptent)) > > + goto lock; > > + entry = pte_to_swp_entry(ptent); > > + if (unlikely(non_swap_entry(entry))) > > + goto lock; > > + page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, > > + walk->vma, start); > > + if (page) > > + page_cache_release(page); > > +lock: > > + pte_offset_map(walk->pmd, start & PMD_MASK); > > + spin_lock(walk->ptl); > > return 0; > > } > > > > @@ -175,8 +168,7 @@ static void force_swapin_readahead(struct vm_area_struct *vma, > > { > > struct mm_walk walk = { > > .mm = vma->vm_mm, > > - .pmd_entry = swapin_walk_pmd_entry, > > - .private = vma, > > + .pte_entry = swapin_walk_pte_entry, > > }; > > > > walk_page_range(start, end, &walk); > > diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c > > index e734f63276c2..24311d6f5c20 100644 > > --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c > > +++ mmotm-2014-05-21-16-57/mm/pagewalk.c > > @@ -27,10 +27,10 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, > > struct mm_struct *mm = walk->mm; > > pte_t *pte; > > pte_t *orig_pte; > > - spinlock_t *ptl; > > int err = 0; > > > > - orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); > > + walk->pmd = pmd; > > + orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &walk->ptl); > > do { > > if (pte_none(*pte)) { > > if (walk->pte_hole) > > @@ -48,7 +48,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, > > if (err) > > break; > > } while (pte++, addr += PAGE_SIZE, addr < end); > > - pte_unmap_unlock(orig_pte, ptl); > > + pte_unmap_unlock(orig_pte, walk->ptl); > > cond_resched(); > > return addr == end ? 0 : err; > > } > > @@ -172,7 +172,6 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, > > unsigned long hmask = huge_page_mask(h); > > pte_t *pte; > > int err = 0; > > - spinlock_t *ptl; > > > > do { > > next = hugetlb_entry_end(h, addr, end); > > @@ -186,14 +185,14 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, > > break; > > continue; > > } > > - ptl = huge_pte_lock(h, mm, pte); > > + walk->ptl = huge_pte_lock(h, mm, pte); > > /* > > * Callers should have their own way to handle swap entries > > * in walk->hugetlb_entry(). > > */ > > if (walk->hugetlb_entry) > > err = walk->hugetlb_entry(pte, addr, next, walk); > > - spin_unlock(ptl); > > + spin_unlock(walk->ptl); > > if (err) > > break; > > } while (addr = next, addr != end); > > -- > > 1.9.3 > > > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f171.google.com (mail-qc0-f171.google.com [209.85.216.171]) by kanga.kvack.org (Postfix) with ESMTP id B69D66B0031 for ; Tue, 17 Jun 2014 10:28:04 -0400 (EDT) Received: by mail-qc0-f171.google.com with SMTP id w7so10257578qcr.30 for ; Tue, 17 Jun 2014 07:28:04 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id v7si17085743qad.84.2014.06.17.07.28.03 for ; Tue, 17 Jun 2014 07:28:04 -0700 (PDT) Message-ID: <53A0506C.6040609@redhat.com> Date: Tue, 17 Jun 2014 16:27:56 +0200 From: Jerome Marchand MIME-Version: 1.0 Subject: Re: [PATCH -mm v2 04/11] pagewalk: move pmd_trans_huge_lock() from callbacks to common code References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1402609691-13950-5-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-5-git-send-email-n-horiguchi@ah.jp.nec.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Naoya Horiguchi , linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org On 06/12/2014 11:48 PM, Naoya Horiguchi wrote: > Now all of current users of page table walker are canonicalized, i.e. > pmd_entry() handles only trans_pmd entry, and pte_entry() handles pte entry. > So we can factorize common code more. > This patch moves pmd_trans_huge_lock() in each pmd_entry() to pagewalk core. > > ChangeLog v2: > - add null check walk->vma in walk_pmd_range() An older version of this patch already made it to linux-next (commit b0e08c5) and I've actually hit the NULL pointer dereference. Moreover, that patch (or maybe another recent pagewalk patch) breaks /proc//smaps. All fields that should have been filled by smaps_pte() are almost always zero (and when it isn't, it's always a multiple of 2MB). It seems to me that the page walk never goes below pmd level. Jerome > - move comment update into a separate patch > > Signed-off-by: Naoya Horiguchi > --- > arch/powerpc/mm/subpage-prot.c | 2 ++ > fs/proc/task_mmu.c | 66 ++++++++++++++---------------------------- > mm/memcontrol.c | 55 ++++++++++------------------------- > mm/pagewalk.c | 18 ++++++++++-- > 4 files changed, 55 insertions(+), 86 deletions(-) > > diff --git mmotm-2014-05-21-16-57.orig/arch/powerpc/mm/subpage-prot.c mmotm-2014-05-21-16-57/arch/powerpc/mm/subpage-prot.c > index fa9fb5b4c66c..d0d94ac606f3 100644 > --- mmotm-2014-05-21-16-57.orig/arch/powerpc/mm/subpage-prot.c > +++ mmotm-2014-05-21-16-57/arch/powerpc/mm/subpage-prot.c > @@ -135,7 +135,9 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr, > unsigned long end, struct mm_walk *walk) > { > struct vm_area_struct *vma = walk->vma; > + spin_unlock(walk->ptl); > split_huge_page_pmd(vma, addr, pmd); > + spin_lock(walk->ptl); > return 0; > } > > diff --git mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c mmotm-2014-05-21-16-57/fs/proc/task_mmu.c > index fa6d6a4e85b3..059206ea3c6b 100644 > --- mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c > +++ mmotm-2014-05-21-16-57/fs/proc/task_mmu.c > @@ -496,15 +496,8 @@ static int smaps_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, > struct mm_walk *walk) > { > struct mem_size_stats *mss = walk->private; > - spinlock_t *ptl; > - > - if (pmd_trans_huge_lock(pmd, walk->vma, &ptl) == 1) { > - smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk); > - spin_unlock(ptl); > - mss->anonymous_thp += HPAGE_PMD_SIZE; > - /* don't call smaps_pte() */ > - walk->skip = 1; > - } > + smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk); > + mss->anonymous_thp += HPAGE_PMD_SIZE; > return 0; > } > > @@ -983,31 +976,21 @@ static int pagemap_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, > struct vm_area_struct *vma = walk->vma; > struct pagemapread *pm = walk->private; > pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); > - spinlock_t *ptl; > - > - if (!vma) > - return err; > - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { > - int pmd_flags2; > + int pmd_flags2; > > - if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd)) > - pmd_flags2 = __PM_SOFT_DIRTY; > - else > - pmd_flags2 = 0; > + if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd)) > + pmd_flags2 = __PM_SOFT_DIRTY; > + else > + pmd_flags2 = 0; > > - for (; addr != end; addr += PAGE_SIZE) { > - unsigned long offset; > + for (; addr != end; addr += PAGE_SIZE) { > + unsigned long offset; > > - offset = (addr & ~PAGEMAP_WALK_MASK) >> > - PAGE_SHIFT; > - thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2); > - err = add_to_pagemap(addr, &pme, pm); > - if (err) > - break; > - } > - spin_unlock(ptl); > - /* don't call pagemap_pte() */ > - walk->skip = 1; > + offset = (addr & ~PAGEMAP_WALK_MASK) >> PAGE_SHIFT; > + thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2); > + err = add_to_pagemap(addr, &pme, pm); > + if (err) > + break; > } > return err; > } > @@ -1277,20 +1260,13 @@ static int gather_pmd_stats(pmd_t *pmd, unsigned long addr, > { > struct numa_maps *md = walk->private; > struct vm_area_struct *vma = walk->vma; > - spinlock_t *ptl; > - > - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { > - pte_t huge_pte = *(pte_t *)pmd; > - struct page *page; > - > - page = can_gather_numa_stats(huge_pte, vma, addr); > - if (page) > - gather_stats(page, md, pte_dirty(huge_pte), > - HPAGE_PMD_SIZE/PAGE_SIZE); > - spin_unlock(ptl); > - /* don't call gather_pte_stats() */ > - walk->skip = 1; > - } > + pte_t huge_pte = *(pte_t *)pmd; > + struct page *page; > + > + page = can_gather_numa_stats(huge_pte, vma, addr); > + if (page) > + gather_stats(page, md, pte_dirty(huge_pte), > + HPAGE_PMD_SIZE/PAGE_SIZE); > return 0; > } > #ifdef CONFIG_HUGETLB_PAGE > diff --git mmotm-2014-05-21-16-57.orig/mm/memcontrol.c mmotm-2014-05-21-16-57/mm/memcontrol.c > index 01a66a208769..bb987cb9e043 100644 > --- mmotm-2014-05-21-16-57.orig/mm/memcontrol.c > +++ mmotm-2014-05-21-16-57/mm/memcontrol.c > @@ -6723,15 +6723,9 @@ static int mem_cgroup_count_precharge_pmd(pmd_t *pmd, > struct mm_walk *walk) > { > struct vm_area_struct *vma = walk->vma; > - spinlock_t *ptl; > - > - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { > - if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) > - mc.precharge += HPAGE_PMD_NR; > - spin_unlock(ptl); > - /* don't call mem_cgroup_count_precharge_pte() */ > - walk->skip = 1; > - } > + > + if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) > + mc.precharge += HPAGE_PMD_NR; > return 0; > } > > @@ -6952,38 +6946,21 @@ static int mem_cgroup_move_charge_pmd(pmd_t *pmd, > struct page *page; > struct page_cgroup *pc; > > - /* > - * We don't take compound_lock() here but no race with splitting thp > - * happens because: > - * - if pmd_trans_huge_lock() returns 1, the relevant thp is not > - * under splitting, which means there's no concurrent thp split, > - * - if another thread runs into split_huge_page() just after we > - * entered this if-block, the thread must wait for page table lock > - * to be unlocked in __split_huge_page_splitting(), where the main > - * part of thp split is not executed yet. > - */ > - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { > - if (mc.precharge < HPAGE_PMD_NR) { > - spin_unlock(ptl); > - return 0; > - } > - target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); > - if (target_type == MC_TARGET_PAGE) { > - page = target.page; > - if (!isolate_lru_page(page)) { > - pc = lookup_page_cgroup(page); > - if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, > - pc, mc.from, mc.to)) { > - mc.precharge -= HPAGE_PMD_NR; > - mc.moved_charge += HPAGE_PMD_NR; > - } > - putback_lru_page(page); > + if (mc.precharge < HPAGE_PMD_NR) > + return 0; > + target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); > + if (target_type == MC_TARGET_PAGE) { > + page = target.page; > + if (!isolate_lru_page(page)) { > + pc = lookup_page_cgroup(page); > + if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, > + pc, mc.from, mc.to)) { > + mc.precharge -= HPAGE_PMD_NR; > + mc.moved_charge += HPAGE_PMD_NR; > } > - put_page(page); > + putback_lru_page(page); > } > - spin_unlock(ptl); > - /* don't call mem_cgroup_move_charge_pte() */ > - walk->skip = 1; > + put_page(page); > } > return 0; > } > diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c > index 24311d6f5c20..f1a3417d0b51 100644 > --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c > +++ mmotm-2014-05-21-16-57/mm/pagewalk.c > @@ -73,8 +73,22 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, > continue; > } > > - if (walk->pmd_entry) { > - err = walk->pmd_entry(pmd, addr, next, walk); > + /* > + * We don't take compound_lock() here but no race with splitting > + * thp happens because: > + * - if pmd_trans_huge_lock() returns 1, the relevant thp is > + * not under splitting, which means there's no concurrent > + * thp split, > + * - if another thread runs into split_huge_page() just after > + * we entered this if-block, the thread must wait for page > + * table lock to be unlocked in __split_huge_page_splitting(), > + * where the main part of thp split is not executed yet. > + */ > + if (walk->pmd_entry && walk->vma) { > + if (pmd_trans_huge_lock(pmd, walk->vma, &walk->ptl) == 1) { > + err = walk->pmd_entry(pmd, addr, next, walk); > + spin_unlock(walk->ptl); > + } > if (skip_lower_level_walking(walk)) > continue; > if (err) > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com [209.85.212.171]) by kanga.kvack.org (Postfix) with ESMTP id 254446B0031 for ; Tue, 17 Jun 2014 11:02:23 -0400 (EDT) Received: by mail-wi0-f171.google.com with SMTP id n15so6023146wiw.16 for ; Tue, 17 Jun 2014 08:02:22 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id pu1si24630534wjc.33.2014.06.17.08.02.20 for ; Tue, 17 Jun 2014 08:02:21 -0700 (PDT) Date: Tue, 17 Jun 2014 11:01:59 -0400 From: Naoya Horiguchi Subject: Re: [PATCH -mm v2 04/11] pagewalk: move pmd_trans_huge_lock() from callbacks to common code Message-ID: <20140617150159.GA8524@nhori.redhat.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1402609691-13950-5-git-send-email-n-horiguchi@ah.jp.nec.com> <53A0506C.6040609@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <53A0506C.6040609@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Marchand Cc: linux-mm@kvack.org, Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org On Tue, Jun 17, 2014 at 04:27:56PM +0200, Jerome Marchand wrote: > On 06/12/2014 11:48 PM, Naoya Horiguchi wrote: > > Now all of current users of page table walker are canonicalized, i.e. > > pmd_entry() handles only trans_pmd entry, and pte_entry() handles pte entry. > > So we can factorize common code more. > > This patch moves pmd_trans_huge_lock() in each pmd_entry() to pagewalk core. > > > > ChangeLog v2: > > - add null check walk->vma in walk_pmd_range() > > An older version of this patch already made it to linux-next (commit > b0e08c5) and I've actually hit the NULL pointer dereference. > > Moreover, that patch (or maybe another recent pagewalk patch) breaks > /proc//smaps. All fields that should have been filled by > smaps_pte() are almost always zero (and when it isn't, it's always a > multiple of 2MB). It seems to me that the page walk never goes below > pmd level. Agreed, I'm now thinking that forcing pte_entry() for every user is not good idea, so I'll return to the start point and just will do only the necessary changes (i.e. only iron out the vma handling problem for hugepage.) Thanks, Naoya Horiguchi > Jerome > > > - move comment update into a separate patch > > > > Signed-off-by: Naoya Horiguchi > > --- > > arch/powerpc/mm/subpage-prot.c | 2 ++ > > fs/proc/task_mmu.c | 66 ++++++++++++++---------------------------- > > mm/memcontrol.c | 55 ++++++++++------------------------- > > mm/pagewalk.c | 18 ++++++++++-- > > 4 files changed, 55 insertions(+), 86 deletions(-) > > > > diff --git mmotm-2014-05-21-16-57.orig/arch/powerpc/mm/subpage-prot.c mmotm-2014-05-21-16-57/arch/powerpc/mm/subpage-prot.c > > index fa9fb5b4c66c..d0d94ac606f3 100644 > > --- mmotm-2014-05-21-16-57.orig/arch/powerpc/mm/subpage-prot.c > > +++ mmotm-2014-05-21-16-57/arch/powerpc/mm/subpage-prot.c > > @@ -135,7 +135,9 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr, > > unsigned long end, struct mm_walk *walk) > > { > > struct vm_area_struct *vma = walk->vma; > > + spin_unlock(walk->ptl); > > split_huge_page_pmd(vma, addr, pmd); > > + spin_lock(walk->ptl); > > return 0; > > } > > > > diff --git mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c mmotm-2014-05-21-16-57/fs/proc/task_mmu.c > > index fa6d6a4e85b3..059206ea3c6b 100644 > > --- mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c > > +++ mmotm-2014-05-21-16-57/fs/proc/task_mmu.c > > @@ -496,15 +496,8 @@ static int smaps_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, > > struct mm_walk *walk) > > { > > struct mem_size_stats *mss = walk->private; > > - spinlock_t *ptl; > > - > > - if (pmd_trans_huge_lock(pmd, walk->vma, &ptl) == 1) { > > - smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk); > > - spin_unlock(ptl); > > - mss->anonymous_thp += HPAGE_PMD_SIZE; > > - /* don't call smaps_pte() */ > > - walk->skip = 1; > > - } > > + smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk); > > + mss->anonymous_thp += HPAGE_PMD_SIZE; > > return 0; > > } > > > > @@ -983,31 +976,21 @@ static int pagemap_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, > > struct vm_area_struct *vma = walk->vma; > > struct pagemapread *pm = walk->private; > > pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); > > - spinlock_t *ptl; > > - > > - if (!vma) > > - return err; > > - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { > > - int pmd_flags2; > > + int pmd_flags2; > > > > - if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd)) > > - pmd_flags2 = __PM_SOFT_DIRTY; > > - else > > - pmd_flags2 = 0; > > + if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd)) > > + pmd_flags2 = __PM_SOFT_DIRTY; > > + else > > + pmd_flags2 = 0; > > > > - for (; addr != end; addr += PAGE_SIZE) { > > - unsigned long offset; > > + for (; addr != end; addr += PAGE_SIZE) { > > + unsigned long offset; > > > > - offset = (addr & ~PAGEMAP_WALK_MASK) >> > > - PAGE_SHIFT; > > - thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2); > > - err = add_to_pagemap(addr, &pme, pm); > > - if (err) > > - break; > > - } > > - spin_unlock(ptl); > > - /* don't call pagemap_pte() */ > > - walk->skip = 1; > > + offset = (addr & ~PAGEMAP_WALK_MASK) >> PAGE_SHIFT; > > + thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2); > > + err = add_to_pagemap(addr, &pme, pm); > > + if (err) > > + break; > > } > > return err; > > } > > @@ -1277,20 +1260,13 @@ static int gather_pmd_stats(pmd_t *pmd, unsigned long addr, > > { > > struct numa_maps *md = walk->private; > > struct vm_area_struct *vma = walk->vma; > > - spinlock_t *ptl; > > - > > - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { > > - pte_t huge_pte = *(pte_t *)pmd; > > - struct page *page; > > - > > - page = can_gather_numa_stats(huge_pte, vma, addr); > > - if (page) > > - gather_stats(page, md, pte_dirty(huge_pte), > > - HPAGE_PMD_SIZE/PAGE_SIZE); > > - spin_unlock(ptl); > > - /* don't call gather_pte_stats() */ > > - walk->skip = 1; > > - } > > + pte_t huge_pte = *(pte_t *)pmd; > > + struct page *page; > > + > > + page = can_gather_numa_stats(huge_pte, vma, addr); > > + if (page) > > + gather_stats(page, md, pte_dirty(huge_pte), > > + HPAGE_PMD_SIZE/PAGE_SIZE); > > return 0; > > } > > #ifdef CONFIG_HUGETLB_PAGE > > diff --git mmotm-2014-05-21-16-57.orig/mm/memcontrol.c mmotm-2014-05-21-16-57/mm/memcontrol.c > > index 01a66a208769..bb987cb9e043 100644 > > --- mmotm-2014-05-21-16-57.orig/mm/memcontrol.c > > +++ mmotm-2014-05-21-16-57/mm/memcontrol.c > > @@ -6723,15 +6723,9 @@ static int mem_cgroup_count_precharge_pmd(pmd_t *pmd, > > struct mm_walk *walk) > > { > > struct vm_area_struct *vma = walk->vma; > > - spinlock_t *ptl; > > - > > - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { > > - if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) > > - mc.precharge += HPAGE_PMD_NR; > > - spin_unlock(ptl); > > - /* don't call mem_cgroup_count_precharge_pte() */ > > - walk->skip = 1; > > - } > > + > > + if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) > > + mc.precharge += HPAGE_PMD_NR; > > return 0; > > } > > > > @@ -6952,38 +6946,21 @@ static int mem_cgroup_move_charge_pmd(pmd_t *pmd, > > struct page *page; > > struct page_cgroup *pc; > > > > - /* > > - * We don't take compound_lock() here but no race with splitting thp > > - * happens because: > > - * - if pmd_trans_huge_lock() returns 1, the relevant thp is not > > - * under splitting, which means there's no concurrent thp split, > > - * - if another thread runs into split_huge_page() just after we > > - * entered this if-block, the thread must wait for page table lock > > - * to be unlocked in __split_huge_page_splitting(), where the main > > - * part of thp split is not executed yet. > > - */ > > - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { > > - if (mc.precharge < HPAGE_PMD_NR) { > > - spin_unlock(ptl); > > - return 0; > > - } > > - target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); > > - if (target_type == MC_TARGET_PAGE) { > > - page = target.page; > > - if (!isolate_lru_page(page)) { > > - pc = lookup_page_cgroup(page); > > - if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, > > - pc, mc.from, mc.to)) { > > - mc.precharge -= HPAGE_PMD_NR; > > - mc.moved_charge += HPAGE_PMD_NR; > > - } > > - putback_lru_page(page); > > + if (mc.precharge < HPAGE_PMD_NR) > > + return 0; > > + target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); > > + if (target_type == MC_TARGET_PAGE) { > > + page = target.page; > > + if (!isolate_lru_page(page)) { > > + pc = lookup_page_cgroup(page); > > + if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, > > + pc, mc.from, mc.to)) { > > + mc.precharge -= HPAGE_PMD_NR; > > + mc.moved_charge += HPAGE_PMD_NR; > > } > > - put_page(page); > > + putback_lru_page(page); > > } > > - spin_unlock(ptl); > > - /* don't call mem_cgroup_move_charge_pte() */ > > - walk->skip = 1; > > + put_page(page); > > } > > return 0; > > } > > diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c > > index 24311d6f5c20..f1a3417d0b51 100644 > > --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c > > +++ mmotm-2014-05-21-16-57/mm/pagewalk.c > > @@ -73,8 +73,22 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, > > continue; > > } > > > > - if (walk->pmd_entry) { > > - err = walk->pmd_entry(pmd, addr, next, walk); > > + /* > > + * We don't take compound_lock() here but no race with splitting > > + * thp happens because: > > + * - if pmd_trans_huge_lock() returns 1, the relevant thp is > > + * not under splitting, which means there's no concurrent > > + * thp split, > > + * - if another thread runs into split_huge_page() just after > > + * we entered this if-block, the thread must wait for page > > + * table lock to be unlocked in __split_huge_page_splitting(), > > + * where the main part of thp split is not executed yet. > > + */ > > + if (walk->pmd_entry && walk->vma) { > > + if (pmd_trans_huge_lock(pmd, walk->vma, &walk->ptl) == 1) { > > + err = walk->pmd_entry(pmd, addr, next, walk); > > + spin_unlock(walk->ptl); > > + } > > if (skip_lower_level_walking(walk)) > > continue; > > if (err) > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f181.google.com (mail-wi0-f181.google.com [209.85.212.181]) by kanga.kvack.org (Postfix) with ESMTP id 203A16B0031 for ; Wed, 18 Jun 2014 11:14:08 -0400 (EDT) Received: by mail-wi0-f181.google.com with SMTP id n3so1337697wiv.8 for ; Wed, 18 Jun 2014 08:14:07 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id r3si17067590wia.1.2014.06.18.08.14.05 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 18 Jun 2014 08:14:06 -0700 (PDT) Message-ID: <53A1ACB1.3050102@redhat.com> Date: Wed, 18 Jun 2014 17:13:53 +0200 From: Jerome Marchand MIME-Version: 1.0 Subject: Re: [PATCH -mm v2 04/11] pagewalk: move pmd_trans_huge_lock() from callbacks to common code References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1402609691-13950-5-git-send-email-n-horiguchi@ah.jp.nec.com> <53A0506C.6040609@redhat.com> <20140617150159.GA8524@nhori.redhat.com> In-Reply-To: <20140617150159.GA8524@nhori.redhat.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Naoya Horiguchi Cc: linux-mm@kvack.org, Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org On 06/17/2014 05:01 PM, Naoya Horiguchi wrote: > On Tue, Jun 17, 2014 at 04:27:56PM +0200, Jerome Marchand wrote: >> On 06/12/2014 11:48 PM, Naoya Horiguchi wrote: >>> Now all of current users of page table walker are canonicalized, i.e. >>> pmd_entry() handles only trans_pmd entry, and pte_entry() handles pte entry. >>> So we can factorize common code more. >>> This patch moves pmd_trans_huge_lock() in each pmd_entry() to pagewalk core. >>> >>> ChangeLog v2: >>> - add null check walk->vma in walk_pmd_range() >> >> An older version of this patch already made it to linux-next (commit >> b0e08c5) and I've actually hit the NULL pointer dereference. >> >> Moreover, that patch (or maybe another recent pagewalk patch) breaks >> /proc//smaps. All fields that should have been filled by >> smaps_pte() are almost always zero (and when it isn't, it's always a >> multiple of 2MB). It seems to me that the page walk never goes below >> pmd level. > > Agreed, I'm now thinking that forcing pte_entry() for every user is not > good idea, so I'll return to the start point and just will do only the > necessary changes (i.e. only iron out the vma handling problem for hugepage.) > > Thanks, > Naoya Horiguchi > >> Jerome >> >>> - move comment update into a separate patch >>> >>> Signed-off-by: Naoya Horiguchi >>> --- >>> diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c >>> index 24311d6f5c20..f1a3417d0b51 100644 >>> --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c >>> +++ mmotm-2014-05-21-16-57/mm/pagewalk.c >>> @@ -73,8 +73,22 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, >>> continue; >>> } >>> >>> - if (walk->pmd_entry) { >>> - err = walk->pmd_entry(pmd, addr, next, walk); >>> + /* >>> + * We don't take compound_lock() here but no race with splitting >>> + * thp happens because: >>> + * - if pmd_trans_huge_lock() returns 1, the relevant thp is >>> + * not under splitting, which means there's no concurrent >>> + * thp split, >>> + * - if another thread runs into split_huge_page() just after >>> + * we entered this if-block, the thread must wait for page >>> + * table lock to be unlocked in __split_huge_page_splitting(), >>> + * where the main part of thp split is not executed yet. >>> + */ >>> + if (walk->pmd_entry && walk->vma) { >>> + if (pmd_trans_huge_lock(pmd, walk->vma, &walk->ptl) == 1) { >>> + err = walk->pmd_entry(pmd, addr, next, walk); >>> + spin_unlock(walk->ptl); >>> + } >>> if (skip_lower_level_walking(walk)) >>> continue; >>> if (err) This is the cause of the smaps trouble. This code modifies walk->control when pmd_entry() is present, even when it is not called. All the control code should depend on pmd_trans_huge_lock() == 1 too. Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f44.google.com (mail-wg0-f44.google.com [74.125.82.44]) by kanga.kvack.org (Postfix) with ESMTP id 701A66B0038 for ; Wed, 18 Jun 2014 11:31:55 -0400 (EDT) Received: by mail-wg0-f44.google.com with SMTP id x13so1018610wgg.3 for ; Wed, 18 Jun 2014 08:31:54 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id by5si3380073wjc.114.2014.06.18.08.31.52 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 18 Jun 2014 08:31:53 -0700 (PDT) Date: Wed, 18 Jun 2014 11:31:45 -0400 From: Naoya Horiguchi Subject: Re: [PATCH -mm v2 04/11] pagewalk: move pmd_trans_huge_lock() from callbacks to common code Message-ID: <20140618153145.GA26866@nhori> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1402609691-13950-5-git-send-email-n-horiguchi@ah.jp.nec.com> <53A0506C.6040609@redhat.com> <20140617150159.GA8524@nhori.redhat.com> <53A1ACB1.3050102@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <53A1ACB1.3050102@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Marchand Cc: linux-mm@kvack.org, Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org On Wed, Jun 18, 2014 at 05:13:53PM +0200, Jerome Marchand wrote: > On 06/17/2014 05:01 PM, Naoya Horiguchi wrote: > > On Tue, Jun 17, 2014 at 04:27:56PM +0200, Jerome Marchand wrote: > >> On 06/12/2014 11:48 PM, Naoya Horiguchi wrote: > >>> Now all of current users of page table walker are canonicalized, i.e. > >>> pmd_entry() handles only trans_pmd entry, and pte_entry() handles pte entry. > >>> So we can factorize common code more. > >>> This patch moves pmd_trans_huge_lock() in each pmd_entry() to pagewalk core. > >>> > >>> ChangeLog v2: > >>> - add null check walk->vma in walk_pmd_range() > >> > >> An older version of this patch already made it to linux-next (commit > >> b0e08c5) and I've actually hit the NULL pointer dereference. > >> > >> Moreover, that patch (or maybe another recent pagewalk patch) breaks > >> /proc//smaps. All fields that should have been filled by > >> smaps_pte() are almost always zero (and when it isn't, it's always a > >> multiple of 2MB). It seems to me that the page walk never goes below > >> pmd level. > > > > Agreed, I'm now thinking that forcing pte_entry() for every user is not > > good idea, so I'll return to the start point and just will do only the > > necessary changes (i.e. only iron out the vma handling problem for hugepage.) > > > > Thanks, > > Naoya Horiguchi > > > >> Jerome > >> > >>> - move comment update into a separate patch > >>> > >>> Signed-off-by: Naoya Horiguchi > >>> --- > > > >>> diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c > >>> index 24311d6f5c20..f1a3417d0b51 100644 > >>> --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c > >>> +++ mmotm-2014-05-21-16-57/mm/pagewalk.c > >>> @@ -73,8 +73,22 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, > >>> continue; > >>> } > >>> > >>> - if (walk->pmd_entry) { > >>> - err = walk->pmd_entry(pmd, addr, next, walk); > >>> + /* > >>> + * We don't take compound_lock() here but no race with splitting > >>> + * thp happens because: > >>> + * - if pmd_trans_huge_lock() returns 1, the relevant thp is > >>> + * not under splitting, which means there's no concurrent > >>> + * thp split, > >>> + * - if another thread runs into split_huge_page() just after > >>> + * we entered this if-block, the thread must wait for page > >>> + * table lock to be unlocked in __split_huge_page_splitting(), > >>> + * where the main part of thp split is not executed yet. > >>> + */ > >>> + if (walk->pmd_entry && walk->vma) { > >>> + if (pmd_trans_huge_lock(pmd, walk->vma, &walk->ptl) == 1) { > >>> + err = walk->pmd_entry(pmd, addr, next, walk); > >>> + spin_unlock(walk->ptl); > >>> + } > >>> if (skip_lower_level_walking(walk)) > >>> continue; > >>> if (err) > > This is the cause of the smaps trouble. This code modifies walk->control > when pmd_entry() is present, even when it is not called. All the control > code should depend on pmd_trans_huge_lock() == 1 too. Thank you for pointing out, I have a few objection about doing aggressive cleanup around this code, and now I'm preparing the next version which does minimum cleanup without walk->control stuff, so I hope your concern will be gone in it. Thanks, Naoya Horiguchi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751027AbaFLVsf (ORCPT ); Thu, 12 Jun 2014 17:48:35 -0400 Received: from mx1.redhat.com ([209.132.183.28]:1563 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750764AbaFLVse (ORCPT ); Thu, 12 Jun 2014 17:48:34 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: [PATCH -mm v2 00/11] pagewalk: standardize current users, move pmd locking, apply to mincore Date: Thu, 12 Jun 2014 17:48:00 -0400 Message-Id: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is ver.2 of page table walker patchset. I move forward on this cleanup work, and added some improvement from the previous version. Major changes are: - removed walk->skip which becomes removable due to refactoring existing users - commonalized the argments of entry handlers (pte|pmd|hugetlb)_entry() which allows us to use the same function as multiple handlers. This patchset is based on mmotm-2014-05-21-16-57. Tree: git@github.com:Naoya-Horiguchi/linux.git Branch: mmotm-2014-05-21-16-57/page_table_walker.v2 Thanks, Naoya Horiguchi --- Summary: Naoya Horiguchi (11): pagewalk: remove pgd_entry() and pud_entry() madvise: cleanup swapin_walk_pmd_entry() memcg: separate mem_cgroup_move_charge_pte_range() pagewalk: move pmd_trans_huge_lock() from callbacks to common code pagewalk: remove mm_walk->skip pagewalk: add size to struct mm_walk pagewalk: change type of arg of callbacks pagewalk: update comment on walk_page_range() fs/proc/task_mmu.c: refactor smaps fs/proc/task_mmu.c: clean up gather_*_stats() mincore: apply page table walker on do_mincore() arch/openrisc/kernel/dma.c | 6 +- arch/powerpc/mm/subpage-prot.c | 5 +- fs/proc/task_mmu.c | 140 ++++++++--------------------- include/linux/mm.h | 21 ++--- mm/huge_memory.c | 20 ----- mm/madvise.c | 55 +++++------- mm/memcontrol.c | 170 +++++++++++++++++------------------ mm/memory.c | 5 +- mm/mempolicy.c | 15 ++-- mm/mincore.c | 195 ++++++++++++++--------------------------- mm/pagewalk.c | 143 +++++++++++++----------------- 11 files changed, 294 insertions(+), 481 deletions(-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751468AbaFLVsl (ORCPT ); Thu, 12 Jun 2014 17:48:41 -0400 Received: from mx1.redhat.com ([209.132.183.28]:45129 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751082AbaFLVsj (ORCPT ); Thu, 12 Jun 2014 17:48:39 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: [PATCH -mm v2 05/11] pagewalk: remove mm_walk->skip Date: Thu, 12 Jun 2014 17:48:05 -0400 Message-Id: <1402609691-13950-6-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Due to the relocation of pmd locking, mm_walk->skip becomes less important because only walk_page_test() and walk->test_walk() use it. None of these functions uses a positive value as a return value, so we can define it to determine whether we skip the current vma or not. Thus this patch removes mm_walk->skip. Signed-off-by: Naoya Horiguchi --- fs/proc/task_mmu.c | 4 ++-- include/linux/mm.h | 3 --- mm/mempolicy.c | 9 ++++----- mm/pagewalk.c | 36 ++++++++---------------------------- 4 files changed, 14 insertions(+), 38 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c mmotm-2014-05-21-16-57/fs/proc/task_mmu.c index 059206ea3c6b..8211f6c8236d 100644 --- mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c +++ mmotm-2014-05-21-16-57/fs/proc/task_mmu.c @@ -755,9 +755,9 @@ static int clear_refs_test_walk(unsigned long start, unsigned long end, * Writing 4 to /proc/pid/clear_refs affects all pages. */ if (cp->type == CLEAR_REFS_ANON && vma->vm_file) - walk->skip = 1; + return 1; if (cp->type == CLEAR_REFS_MAPPED && !vma->vm_file) - walk->skip = 1; + return 1; if (cp->type == CLEAR_REFS_SOFT_DIRTY) { if (vma->vm_flags & VM_SOFTDIRTY) vma->vm_flags &= ~VM_SOFTDIRTY; diff --git mmotm-2014-05-21-16-57.orig/include/linux/mm.h mmotm-2014-05-21-16-57/include/linux/mm.h index aa832161a1ff..0a20674c84e2 100644 --- mmotm-2014-05-21-16-57.orig/include/linux/mm.h +++ mmotm-2014-05-21-16-57/include/linux/mm.h @@ -1106,8 +1106,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, * right now." 0 means "skip the current vma." * @mm: mm_struct representing the target process of page table walk * @vma: vma currently walked - * @skip: internal control flag which is set when we skip the lower - * level entries. * @pmd: current pmd entry * @ptl: page table lock associated with current entry * @private: private data for callbacks' use @@ -1127,7 +1125,6 @@ struct mm_walk { struct mm_walk *walk); struct mm_struct *mm; struct vm_area_struct *vma; - int skip; pmd_t *pmd; spinlock_t *ptl; void *private; diff --git mmotm-2014-05-21-16-57.orig/mm/mempolicy.c mmotm-2014-05-21-16-57/mm/mempolicy.c index cf3b995b21d0..b8267f753748 100644 --- mmotm-2014-05-21-16-57.orig/mm/mempolicy.c +++ mmotm-2014-05-21-16-57/mm/mempolicy.c @@ -596,22 +596,21 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end, } qp->prev = vma; - walk->skip = 1; if (vma->vm_flags & VM_PFNMAP) - return 0; + return 1; if (flags & MPOL_MF_LAZY) { change_prot_numa(vma, start, endvma); - return 0; + return 1; } if ((flags & MPOL_MF_STRICT) || ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) && vma_migratable(vma))) /* queue pages from current vma */ - walk->skip = 0; - return 0; + return 0; + return 1; } /* diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c index f1a3417d0b51..61d6bd9545d6 100644 --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c +++ mmotm-2014-05-21-16-57/mm/pagewalk.c @@ -3,24 +3,6 @@ #include #include -/* - * Check the current skip status of page table walker. - * - * Here what I mean by skip is to skip lower level walking, and that was - * determined for each entry independently. For example, when walk_pmd_range - * handles a pmd_trans_huge we don't have to walk over ptes under that pmd, - * and the skipping does not affect the walking over ptes under other pmds. - * That's why we reset @walk->skip after tested. - */ -static bool skip_lower_level_walking(struct mm_walk *walk) -{ - if (walk->skip) { - walk->skip = 0; - return true; - } - return false; -} - static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -89,8 +71,6 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, err = walk->pmd_entry(pmd, addr, next, walk); spin_unlock(walk->ptl); } - if (skip_lower_level_walking(walk)) - continue; if (err) break; } @@ -225,9 +205,9 @@ static inline int walk_hugetlb_range(unsigned long addr, unsigned long end, /* * Decide whether we really walk over the current vma on [@start, @end) - * or skip it. When we skip it, we set @walk->skip to 1. - * The return value is used to control the page table walking to - * continue (for zero) or not (for non-zero). + * or skip it via the returned value. Return 0 if we do walk over the + * current vma, and return 1 if we skip the vma. Negative values means + * error, where we abort the current walk. * * Default check (only VM_PFNMAP check for now) is used when the caller * doesn't define test_walk() callback. @@ -245,7 +225,7 @@ static int walk_page_test(unsigned long start, unsigned long end, * page backing a VM_PFNMAP range. See also commit a9ff785e4437. */ if (vma->vm_flags & VM_PFNMAP) - walk->skip = 1; + return 1; return 0; } @@ -330,9 +310,9 @@ int walk_page_range(unsigned long start, unsigned long end, next = min(end, vma->vm_end); err = walk_page_test(start, next, walk); - if (skip_lower_level_walking(walk)) + if (err == 1) continue; - if (err) + if (err < 0) break; } err = __walk_page_range(start, next, walk); @@ -353,9 +333,9 @@ int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk) VM_BUG_ON(!vma); walk->vma = vma; err = walk_page_test(vma->vm_start, vma->vm_end, walk); - if (skip_lower_level_walking(walk)) + if (err == 1) return 0; - if (err) + if (err < 0) return err; return __walk_page_range(vma->vm_start, vma->vm_end, walk); } -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751960AbaFLVsv (ORCPT ); Thu, 12 Jun 2014 17:48:51 -0400 Received: from mx1.redhat.com ([209.132.183.28]:64524 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751655AbaFLVst (ORCPT ); Thu, 12 Jun 2014 17:48:49 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: [PATCH -mm v2 11/11] mincore: apply page table walker on do_mincore() Date: Thu, 12 Jun 2014 17:48:11 -0400 Message-Id: <1402609691-13950-12-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch makes do_mincore() use walk_page_vma(), which reduces many lines of code by using common page table walk code. ChangeLog v2: - change type of args of callbacks to void * - move definition of mincore_walk to the start of the function to fix compiler warning Signed-off-by: Naoya Horiguchi --- mm/huge_memory.c | 20 ------ mm/mincore.c | 195 +++++++++++++++++++------------------------------------ 2 files changed, 67 insertions(+), 148 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/mm/huge_memory.c mmotm-2014-05-21-16-57/mm/huge_memory.c index 6fd0668d4e1d..2671a9621d0e 100644 --- mmotm-2014-05-21-16-57.orig/mm/huge_memory.c +++ mmotm-2014-05-21-16-57/mm/huge_memory.c @@ -1379,26 +1379,6 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, return ret; } -int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, unsigned long end, - unsigned char *vec) -{ - spinlock_t *ptl; - int ret = 0; - - if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { - /* - * All logical pages in the range are present - * if backed by a huge page. - */ - spin_unlock(ptl); - memset(vec, 1, (end - addr) >> PAGE_SHIFT); - ret = 1; - } - - return ret; -} - int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma, unsigned long old_addr, unsigned long new_addr, unsigned long old_end, diff --git mmotm-2014-05-21-16-57.orig/mm/mincore.c mmotm-2014-05-21-16-57/mm/mincore.c index 725c80961048..d53b07548ce1 100644 --- mmotm-2014-05-21-16-57.orig/mm/mincore.c +++ mmotm-2014-05-21-16-57/mm/mincore.c @@ -19,38 +19,28 @@ #include #include -static void mincore_hugetlb_page_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, - unsigned char *vec) +static int mincore_hugetlb(void *entry, unsigned long addr, + unsigned long end, struct mm_walk *walk) { + int err = 0; #ifdef CONFIG_HUGETLB_PAGE - struct hstate *h; + pte_t *pte = entry; + unsigned char present; + unsigned char *vec = walk->private; - h = hstate_vma(vma); - while (1) { - unsigned char present; - pte_t *ptep; - /* - * Huge pages are always in RAM for now, but - * theoretically it needs to be checked. - */ - ptep = huge_pte_offset(current->mm, - addr & huge_page_mask(h)); - present = ptep && !huge_pte_none(huge_ptep_get(ptep)); - while (1) { - *vec = present; - vec++; - addr += PAGE_SIZE; - if (addr == end) - return; - /* check hugepage border */ - if (!(addr & ~huge_page_mask(h))) - break; - } - } + /* + * Hugepages under user process are always in RAM and never + * swapped out, but theoretically it needs to be checked. + */ + present = pte && !huge_pte_none(huge_ptep_get(pte)); + for (; addr != end; vec++, addr += PAGE_SIZE) + *vec = present; + cond_resched(); + walk->private += (end - addr) >> PAGE_SHIFT; #else BUG(); #endif + return err; } /* @@ -94,10 +84,11 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) return present; } -static void mincore_unmapped_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, - unsigned char *vec) +static int mincore_hole(unsigned long addr, unsigned long end, + struct mm_walk *walk) { + struct vm_area_struct *vma = walk->vma; + unsigned char *vec = walk->private; unsigned long nr = (end - addr) >> PAGE_SHIFT; int i; @@ -111,110 +102,50 @@ static void mincore_unmapped_range(struct vm_area_struct *vma, for (i = 0; i < nr; i++) vec[i] = 0; } + walk->private += nr; + return 0; } -static void mincore_pte_range(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, unsigned long end, - unsigned char *vec) +static int mincore_pte(void *entry, unsigned long addr, unsigned long end, + struct mm_walk *walk) { - unsigned long next; - spinlock_t *ptl; - pte_t *ptep; - - ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); - do { - pte_t pte = *ptep; - pgoff_t pgoff; - - next = addr + PAGE_SIZE; - if (pte_none(pte)) - mincore_unmapped_range(vma, addr, next, vec); - else if (pte_present(pte)) + pte_t *pte = entry; + struct vm_area_struct *vma = walk->vma; + unsigned char *vec = walk->private; + pgoff_t pgoff; + + if (pte_present(*pte)) + *vec = 1; + else if (pte_file(*pte)) { + pgoff = pte_to_pgoff(*pte); + *vec = mincore_page(vma->vm_file->f_mapping, pgoff); + } else { /* pte is a swap entry */ + swp_entry_t entry = pte_to_swp_entry(*pte); + + if (is_migration_entry(entry)) { + /* migration entries are always uptodate */ *vec = 1; - else if (pte_file(pte)) { - pgoff = pte_to_pgoff(pte); - *vec = mincore_page(vma->vm_file->f_mapping, pgoff); - } else { /* pte is a swap entry */ - swp_entry_t entry = pte_to_swp_entry(pte); - - if (is_migration_entry(entry)) { - /* migration entries are always uptodate */ - *vec = 1; - } else { + } else { #ifdef CONFIG_SWAP - pgoff = entry.val; - *vec = mincore_page(swap_address_space(entry), - pgoff); + pgoff = entry.val; + *vec = mincore_page(swap_address_space(entry), + pgoff); #else - WARN_ON(1); - *vec = 1; + WARN_ON(1); + *vec = 1; #endif - } - } - vec++; - } while (ptep++, addr = next, addr != end); - pte_unmap_unlock(ptep - 1, ptl); -} - -static void mincore_pmd_range(struct vm_area_struct *vma, pud_t *pud, - unsigned long addr, unsigned long end, - unsigned char *vec) -{ - unsigned long next; - pmd_t *pmd; - - pmd = pmd_offset(pud, addr); - do { - next = pmd_addr_end(addr, end); - if (pmd_trans_huge(*pmd)) { - if (mincore_huge_pmd(vma, pmd, addr, next, vec)) { - vec += (next - addr) >> PAGE_SHIFT; - continue; - } - /* fall through */ } - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) - mincore_unmapped_range(vma, addr, next, vec); - else - mincore_pte_range(vma, pmd, addr, next, vec); - vec += (next - addr) >> PAGE_SHIFT; - } while (pmd++, addr = next, addr != end); -} - -static void mincore_pud_range(struct vm_area_struct *vma, pgd_t *pgd, - unsigned long addr, unsigned long end, - unsigned char *vec) -{ - unsigned long next; - pud_t *pud; - - pud = pud_offset(pgd, addr); - do { - next = pud_addr_end(addr, end); - if (pud_none_or_clear_bad(pud)) - mincore_unmapped_range(vma, addr, next, vec); - else - mincore_pmd_range(vma, pud, addr, next, vec); - vec += (next - addr) >> PAGE_SHIFT; - } while (pud++, addr = next, addr != end); + } + walk->private++; + return 0; } -static void mincore_page_range(struct vm_area_struct *vma, - unsigned long addr, unsigned long end, - unsigned char *vec) +static int mincore_pmd(void *entry, unsigned long addr, unsigned long end, + struct mm_walk *walk) { - unsigned long next; - pgd_t *pgd; - - pgd = pgd_offset(vma->vm_mm, addr); - do { - next = pgd_addr_end(addr, end); - if (pgd_none_or_clear_bad(pgd)) - mincore_unmapped_range(vma, addr, next, vec); - else - mincore_pud_range(vma, pgd, addr, next, vec); - vec += (next - addr) >> PAGE_SHIFT; - } while (pgd++, addr = next, addr != end); + memset(walk->private, 1, (end - addr) >> PAGE_SHIFT); + walk->private += (end - addr) >> PAGE_SHIFT; + return 0; } /* @@ -226,19 +157,27 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v { struct vm_area_struct *vma; unsigned long end; + int err; + struct mm_walk mincore_walk = { + .pmd_entry = mincore_pmd, + .pte_entry = mincore_pte, + .pte_hole = mincore_hole, + .hugetlb_entry = mincore_hugetlb, + .private = vec, + }; vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; - + mincore_walk.mm = vma->vm_mm; + mincore_walk.vma = vma; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); - if (is_vm_hugetlb_page(vma)) - mincore_hugetlb_page_range(vma, addr, end, vec); + err = walk_page_vma(vma, &mincore_walk); + if (err < 0) + return err; else - mincore_page_range(vma, addr, end, vec); - - return (end - addr) >> PAGE_SHIFT; + return (end - addr) >> PAGE_SHIFT; } /* -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751867AbaFLVss (ORCPT ); Thu, 12 Jun 2014 17:48:48 -0400 Received: from mx1.redhat.com ([209.132.183.28]:30398 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751631AbaFLVsp (ORCPT ); Thu, 12 Jun 2014 17:48:45 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: [PATCH -mm v2 10/11] fs/proc/task_mmu.c: clean up gather_*_stats() Date: Thu, 12 Jun 2014 17:48:10 -0400 Message-Id: <1402609691-13950-11-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Most code of gather_(pte|pmd|hugetlb)_stats() are duplicate, so let's clean them up with a single function. vm_normal_page() doesn't calculate pgoff correctly for hugetlbfs, so this patch also fixes it. Signed-off-by: Naoya Horiguchi --- fs/proc/task_mmu.c | 58 ++++++------------------------------------------------ mm/memory.c | 5 ++--- 2 files changed, 8 insertions(+), 55 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c mmotm-2014-05-21-16-57/fs/proc/task_mmu.c index 1f2eab58ae14..27ad736c6b30 100644 --- mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c +++ mmotm-2014-05-21-16-57/fs/proc/task_mmu.c @@ -1243,63 +1243,17 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma, return page; } -static int gather_pte_stats(void *entry, unsigned long addr, +static int gather_stats_entry(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { pte_t *pte = entry; struct numa_maps *md = walk->private; - struct page *page = can_gather_numa_stats(*pte, walk->vma, addr); - if (!page) - return 0; - gather_stats(page, md, pte_dirty(*pte), 1); - return 0; -} - -static int gather_pmd_stats(void *entry, unsigned long addr, - unsigned long end, struct mm_walk *walk) -{ - struct numa_maps *md = walk->private; - struct vm_area_struct *vma = walk->vma; - pte_t huge_pte = *(pte_t *)entry; - struct page *page; - - page = can_gather_numa_stats(huge_pte, vma, addr); if (page) - gather_stats(page, md, pte_dirty(huge_pte), - HPAGE_PMD_SIZE/PAGE_SIZE); + gather_stats(page, md, pte_dirty(*pte), + walk->size >> PAGE_SHIFT); return 0; } -#ifdef CONFIG_HUGETLB_PAGE -static int gather_hugetlb_stats(void *entry, unsigned long addr, - unsigned long end, struct mm_walk *walk) -{ - pte_t *pte = entry; - struct numa_maps *md; - struct page *page; - - if (pte_none(*pte)) - return 0; - - if (!pte_present(*pte)) - return 0; - - page = pte_page(*pte); - if (!page) - return 0; - - md = walk->private; - gather_stats(page, md, pte_dirty(*pte), 1); - return 0; -} - -#else -static int gather_hugetlb_stats(void *entry, unsigned long addr, - unsigned long end, struct mm_walk *walk) -{ - return 0; -} -#endif /* * Display pages allocated per node and memory policy via /proc. @@ -1324,9 +1278,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) /* Ensure we start with an empty set of numa_maps statistics. */ memset(md, 0, sizeof(*md)); - walk.hugetlb_entry = gather_hugetlb_stats; - walk.pmd_entry = gather_pmd_stats; - walk.pte_entry = gather_pte_stats; + walk.hugetlb_entry = gather_stats_entry; + walk.pmd_entry = gather_stats_entry; + walk.pte_entry = gather_stats_entry; walk.private = md; walk.mm = mm; walk.vma = vma; diff --git mmotm-2014-05-21-16-57.orig/mm/memory.c mmotm-2014-05-21-16-57/mm/memory.c index fd16b767dd68..7389dd04370f 100644 --- mmotm-2014-05-21-16-57.orig/mm/memory.c +++ mmotm-2014-05-21-16-57/mm/memory.c @@ -768,9 +768,8 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, return NULL; goto out; } else { - unsigned long off; - off = (addr - vma->vm_start) >> PAGE_SHIFT; - if (pfn == vma->vm_pgoff + off) + unsigned long off = linear_page_index(vma, addr); + if (pfn == off) return NULL; if (!is_cow_mapping(vma->vm_flags)) return NULL; -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751597AbaFLVsn (ORCPT ); Thu, 12 Jun 2014 17:48:43 -0400 Received: from mx1.redhat.com ([209.132.183.28]:1300 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751082AbaFLVsl (ORCPT ); Thu, 12 Jun 2014 17:48:41 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: [PATCH -mm v2 08/11] pagewalk: update comment on walk_page_range() Date: Thu, 12 Jun 2014 17:48:08 -0400 Message-Id: <1402609691-13950-9-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Rewriting common code of page table walker has been done, so this patch updates the comment on walk_page_range() for the future development. Signed-off-by: Naoya Horiguchi --- mm/pagewalk.c | 55 +++++++++++++++++++++++++++---------------------------- 1 file changed, 27 insertions(+), 28 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c index b46c8882c643..626a80d4d9dd 100644 --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c +++ mmotm-2014-05-21-16-57/mm/pagewalk.c @@ -253,39 +253,38 @@ static int __walk_page_range(unsigned long start, unsigned long end, * walk_page_range - walk page table with caller specific callbacks * * Recursively walk the page table tree of the process represented by - * @walk->mm within the virtual address range [@start, @end). In walking, - * we can call caller-specific callback functions against each entry. + * @walk->mm within the virtual address range [@start, @end). During walking, + * we can call caller-specific callback functions against each leaf entry. * * Before starting to walk page table, some callers want to check whether - * they really want to walk over the vma (for example by checking vm_flags.) - * walk_page_test() and @walk->test_walk() do that check. + * they really want to walk over the current vma, typically by checking + * its vm_flags. walk_page_test() and @walk->test_walk() are used for this + * purpose. * - * If any callback returns a non-zero value, the page table walk is aborted - * immediately and the return value is propagated back to the caller. - * Note that the meaning of the positive returned value can be defined - * by the caller for its own purpose. + * Currently we have three types of possible leaf enties, pte (for normal + * pages,) pmd (for thps,) and hugetlb. We handle these three with pte_entry(), + * pmd_entry(), and hugetlb_entry(), respectively. + * If you don't set any function to some of these callbacks, the associated + * entries/pages are ignored. + * The return values of these three callbacks are commonly defined like below: + * - 0 : succeeded to handle the current entry, and if you don't reach the + * end address yet, continue to walk. + * - >0 : succeeded to handle the current entry, and return to the caller + * with caller specific value. + * - <0 : failed to handle the current entry, and return to the caller + * with error code. + * We can set the same function to different callbacks, where @walk->size + * should be helpful to know the type of entry in callbacks. * - * If the caller defines multiple callbacks in different levels, the - * callbacks are called in depth-first manner. It could happen that - * multiple callbacks are called on a address. For example if some caller - * defines test_walk(), pmd_entry(), and pte_entry(), then callbacks are - * called in the order of test_walk(), pmd_entry(), and pte_entry(). - * If you don't want to go down to lower level at some point and move to - * the next entry in the same level, you set @walk->skip to 1. - * For example if you succeed to handle some pmd entry as trans_huge entry, - * you need not call walk_pte_range() any more, so set it to avoid that. - * We can't determine whether to go down to lower level with the return - * value of the callback, because the whole range of return values (0, >0, - * and <0) are used up for other meanings. + * struct mm_walk keeps current values of some common data like vma and pmd, + * which are useful for the access from callbacks. If you want to pass some + * caller-specific data to callbacks, @walk->private should be helpful. * - * Each callback can access to the vma over which it is doing page table - * walk right now via @walk->vma. @walk->vma is set to NULL in walking - * outside a vma. If you want to access to some caller-specific data from - * callbacks, @walk->private should be helpful. - * - * The callers should hold @walk->mm->mmap_sem. Note that the lower level - * iterators can take page table lock in lowest level iteration and/or - * in split_huge_page_pmd(). + * Locking: + * Callers of walk_page_range() and walk_page_vma() should hold + * @walk->mm->mmap_sem, because these function traverse vma list and/or + * access to vma's data. And page table lock is held during running + * pmd_entry(), pte_entry(), and hugetlb_entry(). */ int walk_page_range(unsigned long start, unsigned long end, struct mm_walk *walk) -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752154AbaFLVtl (ORCPT ); Thu, 12 Jun 2014 17:49:41 -0400 Received: from mx1.redhat.com ([209.132.183.28]:47510 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751495AbaFLVsn (ORCPT ); Thu, 12 Jun 2014 17:48:43 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: [PATCH -mm v2 09/11] fs/proc/task_mmu.c: refactor smaps Date: Thu, 12 Jun 2014 17:48:09 -0400 Message-Id: <1402609691-13950-10-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org smaps_(pte|pmd)() are almost the same, so let's clean them up to a single function. Signed-off-by: Naoya Horiguchi --- fs/proc/task_mmu.c | 17 ++++++----------- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c mmotm-2014-05-21-16-57/fs/proc/task_mmu.c index a750d0842875..1f2eab58ae14 100644 --- mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c +++ mmotm-2014-05-21-16-57/fs/proc/task_mmu.c @@ -437,7 +437,7 @@ struct mem_size_stats { u64 pss; }; -static int smaps_pte(void *entry, unsigned long addr, unsigned long end, +static int smaps_entry(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { pte_t *pte = entry; @@ -490,15 +490,10 @@ static int smaps_pte(void *entry, unsigned long addr, unsigned long end, mss->private_clean += ptent_size; mss->pss += (ptent_size << PSS_SHIFT); } - return 0; -} -static int smaps_pmd(void *entry, unsigned long addr, unsigned long end, - struct mm_walk *walk) -{ - struct mem_size_stats *mss = walk->private; - smaps_pte(entry, addr, end, walk); - mss->anonymous_thp += HPAGE_PMD_SIZE; + if (walk->size == PMD_SIZE) + mss->anonymous_thp += HPAGE_PMD_SIZE; + return 0; } @@ -563,8 +558,8 @@ static int show_smap(struct seq_file *m, void *v, int is_pid) struct vm_area_struct *vma = v; struct mem_size_stats mss; struct mm_walk smaps_walk = { - .pmd_entry = smaps_pmd, - .pte_entry = smaps_pte, + .pmd_entry = smaps_entry, + .pte_entry = smaps_entry, .mm = vma->vm_mm, .vma = vma, .private = &mss, -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751366AbaFLVsj (ORCPT ); Thu, 12 Jun 2014 17:48:39 -0400 Received: from mx1.redhat.com ([209.132.183.28]:58581 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751082AbaFLVsg (ORCPT ); Thu, 12 Jun 2014 17:48:36 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: [PATCH -mm v2 01/11] pagewalk: remove pgd_entry() and pud_entry() Date: Thu, 12 Jun 2014 17:48:01 -0400 Message-Id: <1402609691-13950-2-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently no user of page table walker sets ->pgd_entry() or ->pud_entry(), so checking their existence in each loop is just wasting CPU cycle. So let's remove it to reduce overhead. Signed-off-by: Naoya Horiguchi --- include/linux/mm.h | 6 ------ mm/pagewalk.c | 18 +----------------- 2 files changed, 1 insertion(+), 23 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/include/linux/mm.h mmotm-2014-05-21-16-57/include/linux/mm.h index 563c79ea07bd..b4aa6579f2b1 100644 --- mmotm-2014-05-21-16-57.orig/include/linux/mm.h +++ mmotm-2014-05-21-16-57/include/linux/mm.h @@ -1092,8 +1092,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, /** * mm_walk - callbacks for walk_page_range - * @pgd_entry: if set, called for each non-empty PGD (top-level) entry - * @pud_entry: if set, called for each non-empty PUD (2nd-level) entry * @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry * this handler is required to be able to handle * pmd_trans_huge() pmds. They may simply choose to @@ -1115,10 +1113,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, * (see the comment on walk_page_range() for more details) */ struct mm_walk { - int (*pgd_entry)(pgd_t *pgd, unsigned long addr, - unsigned long next, struct mm_walk *walk); - int (*pud_entry)(pud_t *pud, unsigned long addr, - unsigned long next, struct mm_walk *walk); int (*pmd_entry)(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk); int (*pte_entry)(pte_t *pte, unsigned long addr, diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c index 2eda3dbe0b52..e734f63276c2 100644 --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c +++ mmotm-2014-05-21-16-57/mm/pagewalk.c @@ -115,14 +115,6 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, continue; } - if (walk->pud_entry) { - err = walk->pud_entry(pud, addr, next, walk); - if (skip_lower_level_walking(walk)) - continue; - if (err) - break; - } - if (walk->pmd_entry || walk->pte_entry) { err = walk_pmd_range(pud, addr, next, walk); if (err) @@ -152,15 +144,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end, continue; } - if (walk->pgd_entry) { - err = walk->pgd_entry(pgd, addr, next, walk); - if (skip_lower_level_walking(walk)) - continue; - if (err) - break; - } - - if (walk->pud_entry || walk->pmd_entry || walk->pte_entry) { + if (walk->pmd_entry || walk->pte_entry) { err = walk_pud_range(pgd, addr, next, walk); if (err) break; -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752185AbaFLVuQ (ORCPT ); Thu, 12 Jun 2014 17:50:16 -0400 Received: from mx1.redhat.com ([209.132.183.28]:59318 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751382AbaFLVsk (ORCPT ); Thu, 12 Jun 2014 17:48:40 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: [PATCH -mm v2 07/11] pagewalk: change type of arg of callbacks Date: Thu, 12 Jun 2014 17:48:07 -0400 Message-Id: <1402609691-13950-8-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Page table walker focuses on the leaf entries, and in some situation the caller is interested only in the size of entry (not in the details of pages pointed to by the entry.) Then it's helpful to share callback functions between different levels. For this purpose this patch changes args in callback functions and let them get the pointer of the entry in type of (void *). Signed-off-by: Naoya Horiguchi --- arch/openrisc/kernel/dma.c | 6 ++++-- arch/powerpc/mm/subpage-prot.c | 3 ++- fs/proc/task_mmu.c | 31 +++++++++++++++++++------------ include/linux/mm.h | 6 +++--- mm/madvise.c | 3 ++- mm/memcontrol.c | 12 ++++++++---- mm/mempolicy.c | 6 ++++-- 7 files changed, 42 insertions(+), 25 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/arch/openrisc/kernel/dma.c mmotm-2014-05-21-16-57/arch/openrisc/kernel/dma.c index 0b77ddb1ee07..a2983e8f6f04 100644 --- mmotm-2014-05-21-16-57.orig/arch/openrisc/kernel/dma.c +++ mmotm-2014-05-21-16-57/arch/openrisc/kernel/dma.c @@ -29,9 +29,10 @@ #include static int -page_set_nocache(pte_t *pte, unsigned long addr, +page_set_nocache(void *entry, unsigned long addr, unsigned long next, struct mm_walk *walk) { + pte_t *pte = entry; unsigned long cl; pte_val(*pte) |= _PAGE_CI; @@ -50,9 +51,10 @@ page_set_nocache(pte_t *pte, unsigned long addr, } static int -page_clear_nocache(pte_t *pte, unsigned long addr, +page_clear_nocache(void *entry, unsigned long addr, unsigned long next, struct mm_walk *walk) { + pte_t *pte = entry; pte_val(*pte) &= ~_PAGE_CI; /* diff --git mmotm-2014-05-21-16-57.orig/arch/powerpc/mm/subpage-prot.c mmotm-2014-05-21-16-57/arch/powerpc/mm/subpage-prot.c index d0d94ac606f3..d62e9adc93fb 100644 --- mmotm-2014-05-21-16-57.orig/arch/powerpc/mm/subpage-prot.c +++ mmotm-2014-05-21-16-57/arch/powerpc/mm/subpage-prot.c @@ -131,9 +131,10 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len) } #ifdef CONFIG_TRANSPARENT_HUGEPAGE -static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr, +static int subpage_walk_pmd_entry(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pmd_t *pmd = entry; struct vm_area_struct *vma = walk->vma; spin_unlock(walk->ptl); split_huge_page_pmd(vma, addr, pmd); diff --git mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c mmotm-2014-05-21-16-57/fs/proc/task_mmu.c index 8211f6c8236d..a750d0842875 100644 --- mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c +++ mmotm-2014-05-21-16-57/fs/proc/task_mmu.c @@ -437,9 +437,10 @@ struct mem_size_stats { u64 pss; }; -static int smaps_pte(pte_t *pte, unsigned long addr, unsigned long end, +static int smaps_pte(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pte_t *pte = entry; struct mem_size_stats *mss = walk->private; struct vm_area_struct *vma = walk->vma; pgoff_t pgoff = linear_page_index(vma, addr); @@ -492,11 +493,11 @@ static int smaps_pte(pte_t *pte, unsigned long addr, unsigned long end, return 0; } -static int smaps_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, +static int smaps_pmd(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct mem_size_stats *mss = walk->private; - smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk); + smaps_pte(entry, addr, end, walk); mss->anonymous_thp += HPAGE_PMD_SIZE; return 0; } @@ -720,9 +721,10 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, #endif } -static int clear_refs_pte(pte_t *pte, unsigned long addr, +static int clear_refs_pte(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pte_t *pte = entry; struct clear_refs_private *cp = walk->private; struct vm_area_struct *vma = walk->vma; struct page *page; @@ -954,9 +956,10 @@ static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemap } #endif -static int pagemap_pte(pte_t *pte, unsigned long addr, unsigned long end, +static int pagemap_pte(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pte_t *pte = entry; struct vm_area_struct *vma = walk->vma; struct pagemapread *pm = walk->private; pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); @@ -969,10 +972,11 @@ static int pagemap_pte(pte_t *pte, unsigned long addr, unsigned long end, return add_to_pagemap(addr, &pme, pm); } -static int pagemap_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, +static int pagemap_pmd(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { int err = 0; + pmd_t *pmd = entry; struct vm_area_struct *vma = walk->vma; struct pagemapread *pm = walk->private; pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); @@ -1009,9 +1013,10 @@ static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread * } /* This function walks within one hugetlb entry in the single call */ -static int pagemap_hugetlb(pte_t *pte, unsigned long addr, unsigned long end, +static int pagemap_hugetlb(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pte_t *pte = entry; struct pagemapread *pm = walk->private; struct vm_area_struct *vma = walk->vma; int err = 0; @@ -1243,9 +1248,10 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma, return page; } -static int gather_pte_stats(pte_t *pte, unsigned long addr, +static int gather_pte_stats(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pte_t *pte = entry; struct numa_maps *md = walk->private; struct page *page = can_gather_numa_stats(*pte, walk->vma, addr); @@ -1255,12 +1261,12 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr, return 0; } -static int gather_pmd_stats(pmd_t *pmd, unsigned long addr, +static int gather_pmd_stats(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct numa_maps *md = walk->private; struct vm_area_struct *vma = walk->vma; - pte_t huge_pte = *(pte_t *)pmd; + pte_t huge_pte = *(pte_t *)entry; struct page *page; page = can_gather_numa_stats(huge_pte, vma, addr); @@ -1270,9 +1276,10 @@ static int gather_pmd_stats(pmd_t *pmd, unsigned long addr, return 0; } #ifdef CONFIG_HUGETLB_PAGE -static int gather_hugetlb_stats(pte_t *pte, unsigned long addr, +static int gather_hugetlb_stats(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pte_t *pte = entry; struct numa_maps *md; struct page *page; @@ -1292,7 +1299,7 @@ static int gather_hugetlb_stats(pte_t *pte, unsigned long addr, } #else -static int gather_hugetlb_stats(pte_t *pte, unsigned long addr, +static int gather_hugetlb_stats(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { return 0; diff --git mmotm-2014-05-21-16-57.orig/include/linux/mm.h mmotm-2014-05-21-16-57/include/linux/mm.h index cbe17d9cbd7f..08c2a128dd5c 100644 --- mmotm-2014-05-21-16-57.orig/include/linux/mm.h +++ mmotm-2014-05-21-16-57/include/linux/mm.h @@ -1114,13 +1114,13 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, * (see the comment on walk_page_range() for more details) */ struct mm_walk { - int (*pmd_entry)(pmd_t *pmd, unsigned long addr, + int (*pmd_entry)(void *entry, unsigned long addr, unsigned long next, struct mm_walk *walk); - int (*pte_entry)(pte_t *pte, unsigned long addr, + int (*pte_entry)(void *entry, unsigned long addr, unsigned long next, struct mm_walk *walk); int (*pte_hole)(unsigned long addr, unsigned long next, struct mm_walk *walk); - int (*hugetlb_entry)(pte_t *pte, unsigned long addr, + int (*hugetlb_entry)(void *entry, unsigned long addr, unsigned long next, struct mm_walk *walk); int (*test_walk)(unsigned long addr, unsigned long next, struct mm_walk *walk); diff --git mmotm-2014-05-21-16-57.orig/mm/madvise.c mmotm-2014-05-21-16-57/mm/madvise.c index 06b390a6fbbd..df664fcbd443 100644 --- mmotm-2014-05-21-16-57.orig/mm/madvise.c +++ mmotm-2014-05-21-16-57/mm/madvise.c @@ -138,9 +138,10 @@ static long madvise_behavior(struct vm_area_struct *vma, /* * Assuming that page table walker holds page table lock. */ -static int swapin_walk_pte_entry(pte_t *pte, unsigned long start, +static int swapin_walk_pte_entry(void *ent, unsigned long start, unsigned long end, struct mm_walk *walk) { + pte_t *pte = ent; pte_t ptent; pte_t *orig_pte = pte - ((start & (PMD_SIZE - 1)) >> PAGE_SHIFT); swp_entry_t entry; diff --git mmotm-2014-05-21-16-57.orig/mm/memcontrol.c mmotm-2014-05-21-16-57/mm/memcontrol.c index bb987cb9e043..7d62b6778a5b 100644 --- mmotm-2014-05-21-16-57.orig/mm/memcontrol.c +++ mmotm-2014-05-21-16-57/mm/memcontrol.c @@ -6709,19 +6709,21 @@ static inline enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma, } #endif -static int mem_cgroup_count_precharge_pte(pte_t *pte, +static int mem_cgroup_count_precharge_pte(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pte_t *pte = entry; if (get_mctgt_type(walk->vma, addr, *pte, NULL)) mc.precharge++; /* increment precharge temporarily */ return 0; } -static int mem_cgroup_count_precharge_pmd(pmd_t *pmd, +static int mem_cgroup_count_precharge_pmd(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pmd_t *pmd = entry; struct vm_area_struct *vma = walk->vma; if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) @@ -6875,11 +6877,12 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys_state *css, mem_cgroup_clear_mc(); } -static int mem_cgroup_move_charge_pte(pte_t *pte, +static int mem_cgroup_move_charge_pte(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { int ret = 0; + pte_t *pte = entry; struct vm_area_struct *vma = walk->vma; union mc_target target; struct page *page; @@ -6936,10 +6939,11 @@ put: /* get_mctgt_type() gets the page */ return 0; } -static int mem_cgroup_move_charge_pmd(pmd_t *pmd, +static int mem_cgroup_move_charge_pmd(void *entry, unsigned long addr, unsigned long end, struct mm_walk *walk) { + pmd_t *pmd = entry; struct vm_area_struct *vma = walk->vma; enum mc_target_type target_type; union mc_target target; diff --git mmotm-2014-05-21-16-57.orig/mm/mempolicy.c mmotm-2014-05-21-16-57/mm/mempolicy.c index b8267f753748..f74cacb36b95 100644 --- mmotm-2014-05-21-16-57.orig/mm/mempolicy.c +++ mmotm-2014-05-21-16-57/mm/mempolicy.c @@ -490,9 +490,10 @@ struct queue_pages { * Scan through pages checking if pages follow certain conditions, * and move them to the pagelist if they do. */ -static int queue_pages_pte(pte_t *pte, unsigned long addr, +static int queue_pages_pte(void *entry, unsigned long addr, unsigned long next, struct mm_walk *walk) { + pte_t *pte = entry; struct vm_area_struct *vma = walk->vma; struct page *page; struct queue_pages *qp = walk->private; @@ -519,10 +520,11 @@ static int queue_pages_pte(pte_t *pte, unsigned long addr, return 0; } -static int queue_pages_hugetlb(pte_t *pte, unsigned long addr, +static int queue_pages_hugetlb(void *ent, unsigned long addr, unsigned long next, struct mm_walk *walk) { #ifdef CONFIG_HUGETLB_PAGE + pte_t *pte = ent; struct queue_pages *qp = walk->private; unsigned long flags = qp->flags; int nid; -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752252AbaFLVum (ORCPT ); Thu, 12 Jun 2014 17:50:42 -0400 Received: from mx1.redhat.com ([209.132.183.28]:61986 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751339AbaFLVsj (ORCPT ); Thu, 12 Jun 2014 17:48:39 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: [PATCH -mm v2 06/11] pagewalk: add size to struct mm_walk Date: Thu, 12 Jun 2014 17:48:06 -0400 Message-Id: <1402609691-13950-7-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This variable is helpful if we try to share the callback function between multiple slots (for example between pte_entry() and pmd_entry()) as done in later patches. Signed-off-by: Naoya Horiguchi --- include/linux/mm.h | 2 ++ mm/pagewalk.c | 5 +++++ 2 files changed, 7 insertions(+) diff --git mmotm-2014-05-21-16-57.orig/include/linux/mm.h mmotm-2014-05-21-16-57/include/linux/mm.h index 0a20674c84e2..cbe17d9cbd7f 100644 --- mmotm-2014-05-21-16-57.orig/include/linux/mm.h +++ mmotm-2014-05-21-16-57/include/linux/mm.h @@ -1108,6 +1108,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, * @vma: vma currently walked * @pmd: current pmd entry * @ptl: page table lock associated with current entry + * @size: size of current entry * @private: private data for callbacks' use * * (see the comment on walk_page_range() for more details) @@ -1127,6 +1128,7 @@ struct mm_walk { struct vm_area_struct *vma; pmd_t *pmd; spinlock_t *ptl; + long size; void *private; }; diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c index 61d6bd9545d6..b46c8882c643 100644 --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c +++ mmotm-2014-05-21-16-57/mm/pagewalk.c @@ -11,6 +11,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, pte_t *orig_pte; int err = 0; + walk->size = PAGE_SIZE; walk->pmd = pmd; orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &walk->ptl); do { @@ -42,6 +43,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long next; int err = 0; + walk->size = PMD_SIZE; pmd = pmd_offset(pud, addr); do { again: @@ -97,6 +99,7 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long next; int err = 0; + walk->size = PUD_SIZE; pud = pud_offset(pgd, addr); do { next = pud_addr_end(addr, end); @@ -126,6 +129,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end, unsigned long next; int err = 0; + walk->size = PGDIR_SIZE; pgd = pgd_offset(walk->mm, addr); do { next = pgd_addr_end(addr, end); @@ -167,6 +171,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, pte_t *pte; int err = 0; + walk->size = huge_page_size(h); do { next = hugetlb_entry_end(h, addr, end); pte = huge_pte_offset(walk->mm, addr & hmask); -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752397AbaFLVvF (ORCPT ); Thu, 12 Jun 2014 17:51:05 -0400 Received: from mx1.redhat.com ([209.132.183.28]:32634 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751284AbaFLVsi (ORCPT ); Thu, 12 Jun 2014 17:48:38 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: [PATCH -mm v2 04/11] pagewalk: move pmd_trans_huge_lock() from callbacks to common code Date: Thu, 12 Jun 2014 17:48:04 -0400 Message-Id: <1402609691-13950-5-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now all of current users of page table walker are canonicalized, i.e. pmd_entry() handles only trans_pmd entry, and pte_entry() handles pte entry. So we can factorize common code more. This patch moves pmd_trans_huge_lock() in each pmd_entry() to pagewalk core. ChangeLog v2: - add null check walk->vma in walk_pmd_range() - move comment update into a separate patch Signed-off-by: Naoya Horiguchi --- arch/powerpc/mm/subpage-prot.c | 2 ++ fs/proc/task_mmu.c | 66 ++++++++++++++---------------------------- mm/memcontrol.c | 55 ++++++++++------------------------- mm/pagewalk.c | 18 ++++++++++-- 4 files changed, 55 insertions(+), 86 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/arch/powerpc/mm/subpage-prot.c mmotm-2014-05-21-16-57/arch/powerpc/mm/subpage-prot.c index fa9fb5b4c66c..d0d94ac606f3 100644 --- mmotm-2014-05-21-16-57.orig/arch/powerpc/mm/subpage-prot.c +++ mmotm-2014-05-21-16-57/arch/powerpc/mm/subpage-prot.c @@ -135,7 +135,9 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct vm_area_struct *vma = walk->vma; + spin_unlock(walk->ptl); split_huge_page_pmd(vma, addr, pmd); + spin_lock(walk->ptl); return 0; } diff --git mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c mmotm-2014-05-21-16-57/fs/proc/task_mmu.c index fa6d6a4e85b3..059206ea3c6b 100644 --- mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c +++ mmotm-2014-05-21-16-57/fs/proc/task_mmu.c @@ -496,15 +496,8 @@ static int smaps_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct mem_size_stats *mss = walk->private; - spinlock_t *ptl; - - if (pmd_trans_huge_lock(pmd, walk->vma, &ptl) == 1) { - smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk); - spin_unlock(ptl); - mss->anonymous_thp += HPAGE_PMD_SIZE; - /* don't call smaps_pte() */ - walk->skip = 1; - } + smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk); + mss->anonymous_thp += HPAGE_PMD_SIZE; return 0; } @@ -983,31 +976,21 @@ static int pagemap_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, struct vm_area_struct *vma = walk->vma; struct pagemapread *pm = walk->private; pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); - spinlock_t *ptl; - - if (!vma) - return err; - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { - int pmd_flags2; + int pmd_flags2; - if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd)) - pmd_flags2 = __PM_SOFT_DIRTY; - else - pmd_flags2 = 0; + if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd)) + pmd_flags2 = __PM_SOFT_DIRTY; + else + pmd_flags2 = 0; - for (; addr != end; addr += PAGE_SIZE) { - unsigned long offset; + for (; addr != end; addr += PAGE_SIZE) { + unsigned long offset; - offset = (addr & ~PAGEMAP_WALK_MASK) >> - PAGE_SHIFT; - thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2); - err = add_to_pagemap(addr, &pme, pm); - if (err) - break; - } - spin_unlock(ptl); - /* don't call pagemap_pte() */ - walk->skip = 1; + offset = (addr & ~PAGEMAP_WALK_MASK) >> PAGE_SHIFT; + thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2); + err = add_to_pagemap(addr, &pme, pm); + if (err) + break; } return err; } @@ -1277,20 +1260,13 @@ static int gather_pmd_stats(pmd_t *pmd, unsigned long addr, { struct numa_maps *md = walk->private; struct vm_area_struct *vma = walk->vma; - spinlock_t *ptl; - - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { - pte_t huge_pte = *(pte_t *)pmd; - struct page *page; - - page = can_gather_numa_stats(huge_pte, vma, addr); - if (page) - gather_stats(page, md, pte_dirty(huge_pte), - HPAGE_PMD_SIZE/PAGE_SIZE); - spin_unlock(ptl); - /* don't call gather_pte_stats() */ - walk->skip = 1; - } + pte_t huge_pte = *(pte_t *)pmd; + struct page *page; + + page = can_gather_numa_stats(huge_pte, vma, addr); + if (page) + gather_stats(page, md, pte_dirty(huge_pte), + HPAGE_PMD_SIZE/PAGE_SIZE); return 0; } #ifdef CONFIG_HUGETLB_PAGE diff --git mmotm-2014-05-21-16-57.orig/mm/memcontrol.c mmotm-2014-05-21-16-57/mm/memcontrol.c index 01a66a208769..bb987cb9e043 100644 --- mmotm-2014-05-21-16-57.orig/mm/memcontrol.c +++ mmotm-2014-05-21-16-57/mm/memcontrol.c @@ -6723,15 +6723,9 @@ static int mem_cgroup_count_precharge_pmd(pmd_t *pmd, struct mm_walk *walk) { struct vm_area_struct *vma = walk->vma; - spinlock_t *ptl; - - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { - if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) - mc.precharge += HPAGE_PMD_NR; - spin_unlock(ptl); - /* don't call mem_cgroup_count_precharge_pte() */ - walk->skip = 1; - } + + if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) + mc.precharge += HPAGE_PMD_NR; return 0; } @@ -6952,38 +6946,21 @@ static int mem_cgroup_move_charge_pmd(pmd_t *pmd, struct page *page; struct page_cgroup *pc; - /* - * We don't take compound_lock() here but no race with splitting thp - * happens because: - * - if pmd_trans_huge_lock() returns 1, the relevant thp is not - * under splitting, which means there's no concurrent thp split, - * - if another thread runs into split_huge_page() just after we - * entered this if-block, the thread must wait for page table lock - * to be unlocked in __split_huge_page_splitting(), where the main - * part of thp split is not executed yet. - */ - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { - if (mc.precharge < HPAGE_PMD_NR) { - spin_unlock(ptl); - return 0; - } - target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); - if (target_type == MC_TARGET_PAGE) { - page = target.page; - if (!isolate_lru_page(page)) { - pc = lookup_page_cgroup(page); - if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, - pc, mc.from, mc.to)) { - mc.precharge -= HPAGE_PMD_NR; - mc.moved_charge += HPAGE_PMD_NR; - } - putback_lru_page(page); + if (mc.precharge < HPAGE_PMD_NR) + return 0; + target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); + if (target_type == MC_TARGET_PAGE) { + page = target.page; + if (!isolate_lru_page(page)) { + pc = lookup_page_cgroup(page); + if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, + pc, mc.from, mc.to)) { + mc.precharge -= HPAGE_PMD_NR; + mc.moved_charge += HPAGE_PMD_NR; } - put_page(page); + putback_lru_page(page); } - spin_unlock(ptl); - /* don't call mem_cgroup_move_charge_pte() */ - walk->skip = 1; + put_page(page); } return 0; } diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c index 24311d6f5c20..f1a3417d0b51 100644 --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c +++ mmotm-2014-05-21-16-57/mm/pagewalk.c @@ -73,8 +73,22 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, continue; } - if (walk->pmd_entry) { - err = walk->pmd_entry(pmd, addr, next, walk); + /* + * We don't take compound_lock() here but no race with splitting + * thp happens because: + * - if pmd_trans_huge_lock() returns 1, the relevant thp is + * not under splitting, which means there's no concurrent + * thp split, + * - if another thread runs into split_huge_page() just after + * we entered this if-block, the thread must wait for page + * table lock to be unlocked in __split_huge_page_splitting(), + * where the main part of thp split is not executed yet. + */ + if (walk->pmd_entry && walk->vma) { + if (pmd_trans_huge_lock(pmd, walk->vma, &walk->ptl) == 1) { + err = walk->pmd_entry(pmd, addr, next, walk); + spin_unlock(walk->ptl); + } if (skip_lower_level_walking(walk)) continue; if (err) -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751288AbaFLVsh (ORCPT ); Thu, 12 Jun 2014 17:48:37 -0400 Received: from mx1.redhat.com ([209.132.183.28]:17563 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750764AbaFLVsg (ORCPT ); Thu, 12 Jun 2014 17:48:36 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: [PATCH -mm v2 03/11] memcg: separate mem_cgroup_move_charge_pte_range() Date: Thu, 12 Jun 2014 17:48:03 -0400 Message-Id: <1402609691-13950-4-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org mem_cgroup_move_charge_pte_range() handles both pte and pmd, which is not standardized, so let's cleanup it. One tricky part is the retry, which is performed when we detect !mc.precharge. In such case we retry the same entry, so we don't have to go outside the pte loop. With rewriting this retry in the pte loop, we can separate pmd_entry() and pte_entry(), which is what we need. Signed-off-by: Naoya Horiguchi --- mm/memcontrol.c | 127 +++++++++++++++++++++++++++++--------------------------- 1 file changed, 66 insertions(+), 61 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/mm/memcontrol.c mmotm-2014-05-21-16-57/mm/memcontrol.c index 6970857ba0c8..01a66a208769 100644 --- mmotm-2014-05-21-16-57.orig/mm/memcontrol.c +++ mmotm-2014-05-21-16-57/mm/memcontrol.c @@ -6881,14 +6881,72 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys_state *css, mem_cgroup_clear_mc(); } -static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, +static int mem_cgroup_move_charge_pte(pte_t *pte, unsigned long addr, unsigned long end, struct mm_walk *walk) { int ret = 0; struct vm_area_struct *vma = walk->vma; - pte_t *pte; - spinlock_t *ptl; + union mc_target target; + struct page *page; + struct page_cgroup *pc; + swp_entry_t ent; + +retry: + if (!mc.precharge) { + pte_t *orig_pte = pte - ((addr & (PMD_SIZE - 1)) >> PAGE_SHIFT); + pte_unmap_unlock(orig_pte, walk->ptl); + cond_resched(); + /* + * We have consumed all precharges we got in can_attach(). + * We try charge one by one, but don't do any additional + * charges to mc.to if we have failed in charge once in attach() + * phase. + */ + ret = mem_cgroup_do_precharge(1); + pte_offset_map(walk->pmd, addr & PMD_MASK); + spin_lock(walk->ptl); + if (!ret) + goto retry; + return ret; + } + + switch (get_mctgt_type(vma, addr, *pte, &target)) { + case MC_TARGET_PAGE: + page = target.page; + if (isolate_lru_page(page)) + goto put; + pc = lookup_page_cgroup(page); + if (!mem_cgroup_move_account(page, 1, pc, + mc.from, mc.to)) { + mc.precharge--; + /* we uncharge from mc.from later. */ + mc.moved_charge++; + } + putback_lru_page(page); +put: /* get_mctgt_type() gets the page */ + put_page(page); + break; + case MC_TARGET_SWAP: + ent = target.ent; + if (!mem_cgroup_move_swap_account(ent, mc.from, mc.to)) { + mc.precharge--; + /* we fixup refcnts and charges later. */ + mc.moved_swap++; + } + break; + default: + break; + } + + return 0; +} + +static int mem_cgroup_move_charge_pmd(pmd_t *pmd, + unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + struct vm_area_struct *vma = walk->vma; enum mc_target_type target_type; union mc_target target; struct page *page; @@ -6924,71 +6982,18 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, put_page(page); } spin_unlock(ptl); - return 0; - } - - if (pmd_trans_unstable(pmd)) - return 0; -retry: - pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); - for (; addr != end; addr += PAGE_SIZE) { - pte_t ptent = *(pte++); - swp_entry_t ent; - - if (!mc.precharge) - break; - - switch (get_mctgt_type(vma, addr, ptent, &target)) { - case MC_TARGET_PAGE: - page = target.page; - if (isolate_lru_page(page)) - goto put; - pc = lookup_page_cgroup(page); - if (!mem_cgroup_move_account(page, 1, pc, - mc.from, mc.to)) { - mc.precharge--; - /* we uncharge from mc.from later. */ - mc.moved_charge++; - } - putback_lru_page(page); -put: /* get_mctgt_type() gets the page */ - put_page(page); - break; - case MC_TARGET_SWAP: - ent = target.ent; - if (!mem_cgroup_move_swap_account(ent, mc.from, mc.to)) { - mc.precharge--; - /* we fixup refcnts and charges later. */ - mc.moved_swap++; - } - break; - default: - break; - } - } - pte_unmap_unlock(pte - 1, ptl); - cond_resched(); - - if (addr != end) { - /* - * We have consumed all precharges we got in can_attach(). - * We try charge one by one, but don't do any additional - * charges to mc.to if we have failed in charge once in attach() - * phase. - */ - ret = mem_cgroup_do_precharge(1); - if (!ret) - goto retry; + /* don't call mem_cgroup_move_charge_pte() */ + walk->skip = 1; } - - return ret; + return 0; } static void mem_cgroup_move_charge(struct mm_struct *mm) { struct vm_area_struct *vma; struct mm_walk mem_cgroup_move_charge_walk = { - .pmd_entry = mem_cgroup_move_charge_pte_range, + .pmd_entry = mem_cgroup_move_charge_pmd, + .pte_entry = mem_cgroup_move_charge_pte, .mm = mm, }; -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752474AbaFLVvi (ORCPT ); Thu, 12 Jun 2014 17:51:38 -0400 Received: from mx1.redhat.com ([209.132.183.28]:15687 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751075AbaFLVsg (ORCPT ); Thu, 12 Jun 2014 17:48:36 -0400 From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: [PATCH -mm v2 02/11] madvise: cleanup swapin_walk_pmd_entry() Date: Thu, 12 Jun 2014 17:48:02 -0400 Message-Id: <1402609691-13950-3-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org With the recent update on page table walker, we can use common code for the walking more. Unlike many other users, this swapin_walk expects to handle swap entries. As a result we should be careful about ptl locking. Swapin operation, read_swap_cache_async(), could cause page reclaim, so we can't keep holding ptl throughout this pte loop. In order to properly handle ptl in pte_entry(), this patch adds two new members on struct mm_walk. This cleanup is necessary to get to the final form of page table walker, where we should do all caller's specific work on leaf entries (IOW, all pmd_entry() should be used for trans_pmd.) Signed-off-by: Naoya Horiguchi Cc: Hugh Dickins --- include/linux/mm.h | 4 ++++ mm/madvise.c | 54 +++++++++++++++++++++++------------------------------- mm/pagewalk.c | 11 +++++------ 3 files changed, 32 insertions(+), 37 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/include/linux/mm.h mmotm-2014-05-21-16-57/include/linux/mm.h index b4aa6579f2b1..aa832161a1ff 100644 --- mmotm-2014-05-21-16-57.orig/include/linux/mm.h +++ mmotm-2014-05-21-16-57/include/linux/mm.h @@ -1108,6 +1108,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, * @vma: vma currently walked * @skip: internal control flag which is set when we skip the lower * level entries. + * @pmd: current pmd entry + * @ptl: page table lock associated with current entry * @private: private data for callbacks' use * * (see the comment on walk_page_range() for more details) @@ -1126,6 +1128,8 @@ struct mm_walk { struct mm_struct *mm; struct vm_area_struct *vma; int skip; + pmd_t *pmd; + spinlock_t *ptl; void *private; }; diff --git mmotm-2014-05-21-16-57.orig/mm/madvise.c mmotm-2014-05-21-16-57/mm/madvise.c index a402f8fdc68e..06b390a6fbbd 100644 --- mmotm-2014-05-21-16-57.orig/mm/madvise.c +++ mmotm-2014-05-21-16-57/mm/madvise.c @@ -135,38 +135,31 @@ static long madvise_behavior(struct vm_area_struct *vma, } #ifdef CONFIG_SWAP -static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, +/* + * Assuming that page table walker holds page table lock. + */ +static int swapin_walk_pte_entry(pte_t *pte, unsigned long start, unsigned long end, struct mm_walk *walk) { - pte_t *orig_pte; - struct vm_area_struct *vma = walk->private; - unsigned long index; - - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) - return 0; - - for (index = start; index != end; index += PAGE_SIZE) { - pte_t pte; - swp_entry_t entry; - struct page *page; - spinlock_t *ptl; - - orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); - pte = *(orig_pte + ((index - start) / PAGE_SIZE)); - pte_unmap_unlock(orig_pte, ptl); - - if (pte_present(pte) || pte_none(pte) || pte_file(pte)) - continue; - entry = pte_to_swp_entry(pte); - if (unlikely(non_swap_entry(entry))) - continue; - - page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, - vma, index); - if (page) - page_cache_release(page); - } + pte_t ptent; + pte_t *orig_pte = pte - ((start & (PMD_SIZE - 1)) >> PAGE_SHIFT); + swp_entry_t entry; + struct page *page; + ptent = *pte; + pte_unmap_unlock(orig_pte, walk->ptl); + if (pte_present(ptent) || pte_none(ptent) || pte_file(ptent)) + goto lock; + entry = pte_to_swp_entry(ptent); + if (unlikely(non_swap_entry(entry))) + goto lock; + page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, + walk->vma, start); + if (page) + page_cache_release(page); +lock: + pte_offset_map(walk->pmd, start & PMD_MASK); + spin_lock(walk->ptl); return 0; } @@ -175,8 +168,7 @@ static void force_swapin_readahead(struct vm_area_struct *vma, { struct mm_walk walk = { .mm = vma->vm_mm, - .pmd_entry = swapin_walk_pmd_entry, - .private = vma, + .pte_entry = swapin_walk_pte_entry, }; walk_page_range(start, end, &walk); diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c index e734f63276c2..24311d6f5c20 100644 --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c +++ mmotm-2014-05-21-16-57/mm/pagewalk.c @@ -27,10 +27,10 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, struct mm_struct *mm = walk->mm; pte_t *pte; pte_t *orig_pte; - spinlock_t *ptl; int err = 0; - orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + walk->pmd = pmd; + orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &walk->ptl); do { if (pte_none(*pte)) { if (walk->pte_hole) @@ -48,7 +48,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, if (err) break; } while (pte++, addr += PAGE_SIZE, addr < end); - pte_unmap_unlock(orig_pte, ptl); + pte_unmap_unlock(orig_pte, walk->ptl); cond_resched(); return addr == end ? 0 : err; } @@ -172,7 +172,6 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, unsigned long hmask = huge_page_mask(h); pte_t *pte; int err = 0; - spinlock_t *ptl; do { next = hugetlb_entry_end(h, addr, end); @@ -186,14 +185,14 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, break; continue; } - ptl = huge_pte_lock(h, mm, pte); + walk->ptl = huge_pte_lock(h, mm, pte); /* * Callers should have their own way to handle swap entries * in walk->hugetlb_entry(). */ if (walk->hugetlb_entry) err = walk->hugetlb_entry(pte, addr, next, walk); - spin_unlock(ptl); + spin_unlock(walk->ptl); if (err) break; } while (addr = next, addr != end); -- 1.9.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750920AbaFLV4T (ORCPT ); Thu, 12 Jun 2014 17:56:19 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:36216 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750773AbaFLV4S (ORCPT ); Thu, 12 Jun 2014 17:56:18 -0400 Date: Thu, 12 Jun 2014 14:56:17 -0700 From: Andrew Morton To: Naoya Horiguchi Cc: linux-mm@kvack.org, Dave Hansen , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: Re: [PATCH -mm v2 00/11] pagewalk: standardize current users, move pmd locking, apply to mincore Message-Id: <20140612145617.5debf04bd1a2978be4a1fb88@linux-foundation.org> In-Reply-To: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> X-Mailer: Sylpheed 3.2.0beta5 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 12 Jun 2014 17:48:00 -0400 Naoya Horiguchi wrote: > This is ver.2 of page table walker patchset. > > I move forward on this cleanup work, and added some improvement from the > previous version. Major changes are: > - removed walk->skip which becomes removable due to refactoring existing > users > - commonalized the argments of entry handlers (pte|pmd|hugetlb)_entry() > which allows us to use the same function as multiple handlers. > Are you sure you didn't miss anything? mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff.patch mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v2.patch mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v3.patch mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v3-fix.patch pagewalk-update-page-table-walker-core.patch pagewalk-update-page-table-walker-core-fix-end-address-calculation-in-walk_page_range.patch pagewalk-update-page-table-walker-core-fix-end-address-calculation-in-walk_page_range-fix.patch pagewalk-update-page-table-walker-core-fix.patch pagewalk-add-walk_page_vma.patch smaps-redefine-callback-functions-for-page-table-walker.patch clear_refs-redefine-callback-functions-for-page-table-walker.patch pagemap-redefine-callback-functions-for-page-table-walker.patch pagemap-redefine-callback-functions-for-page-table-walker-fix.patch numa_maps-redefine-callback-functions-for-page-table-walker.patch memcg-redefine-callback-functions-for-page-table-walker.patch arch-powerpc-mm-subpage-protc-use-walk_page_vma-instead-of-walk_page_range.patch pagewalk-remove-argument-hmask-from-hugetlb_entry.patch pagewalk-remove-argument-hmask-from-hugetlb_entry-fix.patch pagewalk-remove-argument-hmask-from-hugetlb_entry-fix-fix.patch mempolicy-apply-page-table-walker-on-queue_pages_range.patch mm-pagewalkc-move-pte-null-check.patch mm-prom-pid-clear_refs-avoid-split_huge_page.patch # mm-pagewalk-remove-pgd_entry-and-pud_entry.patch mm-pagewalk-replace-mm_walk-skip-with-more-general-mm_walk-control.patch madvise-cleanup-swapin_walk_pmd_entry.patch memcg-separate-mem_cgroup_move_charge_pte_range.patch memcg-separate-mem_cgroup_move_charge_pte_range-checkpatch-fixes.patch arch-powerpc-mm-subpage-protc-cleanup-subpage_walk_pmd_entry.patch mm-pagewalk-move-pmd_trans_huge_lock-from-callbacks-to-common-code.patch mm-pagewalk-move-pmd_trans_huge_lock-from-callbacks-to-common-code-checkpatch-fixes.patch mincore-apply-page-table-walker-on-do_mincore.patch mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff.patch apepars to have disappeared, I didn't check the rest. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751138AbaFLWHf (ORCPT ); Thu, 12 Jun 2014 18:07:35 -0400 Received: from mga02.intel.com ([134.134.136.20]:39041 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750814AbaFLWHd (ORCPT ); Thu, 12 Jun 2014 18:07:33 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.01,467,1400050800"; d="scan'208";a="556555724" Message-ID: <539A248F.2090306@intel.com> Date: Thu, 12 Jun 2014 15:07:11 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 MIME-Version: 1.0 To: Naoya Horiguchi , linux-mm@kvack.org CC: Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: Re: [PATCH -mm v2 06/11] pagewalk: add size to struct mm_walk References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1402609691-13950-7-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-7-git-send-email-n-horiguchi@ah.jp.nec.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/12/2014 02:48 PM, Naoya Horiguchi wrote: > This variable is helpful if we try to share the callback function between > multiple slots (for example between pte_entry() and pmd_entry()) as done > in later patches. smaps_pte() already does this: static int smaps_pte(pte_t *pte, unsigned long addr, unsigned long end, struct mm_walk *walk) ... unsigned long ptent_size = end - addr; Other than the hugetlb handler, can't we always imply the size from end-addr? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751867AbaFOUZx (ORCPT ); Sun, 15 Jun 2014 16:25:53 -0400 Received: from mail-pa0-f45.google.com ([209.85.220.45]:47842 "EHLO mail-pa0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750993AbaFOUZv (ORCPT ); Sun, 15 Jun 2014 16:25:51 -0400 Date: Sun, 15 Jun 2014 13:24:30 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Naoya Horiguchi cc: linux-mm@kvack.org, Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: Re: [PATCH -mm v2 02/11] madvise: cleanup swapin_walk_pmd_entry() In-Reply-To: <1402609691-13950-3-git-send-email-n-horiguchi@ah.jp.nec.com> Message-ID: References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1402609691-13950-3-git-send-email-n-horiguchi@ah.jp.nec.com> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 12 Jun 2014, Naoya Horiguchi wrote: > With the recent update on page table walker, we can use common code for > the walking more. Unlike many other users, this swapin_walk expects to > handle swap entries. As a result we should be careful about ptl locking. > Swapin operation, read_swap_cache_async(), could cause page reclaim, so > we can't keep holding ptl throughout this pte loop. > In order to properly handle ptl in pte_entry(), this patch adds two new > members on struct mm_walk. > > This cleanup is necessary to get to the final form of page table walker, > where we should do all caller's specific work on leaf entries (IOW, all > pmd_entry() should be used for trans_pmd.) > > Signed-off-by: Naoya Horiguchi > Cc: Hugh Dickins Sorry, I believe this (and probably several other of your conversions) is badly flawed. You have a pattern of doing pte_offset_map_lock() inside the page walker, then dropping and regetting map and lock inside your pte handler. But on, say, x86_32 with CONFIG_HIGHMEM, CONFIG_SMP and CONFIG_PREEMPT, you may be preempted then run on a different cpu while atomic kmap and lock are dropped: so that the pte pointer then used on return to walk_pte_range() will no longer correspond to the right mapping. Presumably that can be fixed by keeping the pte pointer in the mm_walk structure; but I'm not at all sure that's the right thing to do. I am not nearly so keen as you to reduce all these to per-pte callouts, which seem inefficient to me. It can be argued both ways on the less important functions (like this madvise one); but I hope you don't try to make this kind of conversion to fast paths like those in memory.c. Hugh > --- > include/linux/mm.h | 4 ++++ > mm/madvise.c | 54 +++++++++++++++++++++++------------------------------- > mm/pagewalk.c | 11 +++++------ > 3 files changed, 32 insertions(+), 37 deletions(-) > > diff --git mmotm-2014-05-21-16-57.orig/include/linux/mm.h mmotm-2014-05-21-16-57/include/linux/mm.h > index b4aa6579f2b1..aa832161a1ff 100644 > --- mmotm-2014-05-21-16-57.orig/include/linux/mm.h > +++ mmotm-2014-05-21-16-57/include/linux/mm.h > @@ -1108,6 +1108,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, > * @vma: vma currently walked > * @skip: internal control flag which is set when we skip the lower > * level entries. > + * @pmd: current pmd entry > + * @ptl: page table lock associated with current entry > * @private: private data for callbacks' use > * > * (see the comment on walk_page_range() for more details) > @@ -1126,6 +1128,8 @@ struct mm_walk { > struct mm_struct *mm; > struct vm_area_struct *vma; > int skip; > + pmd_t *pmd; > + spinlock_t *ptl; > void *private; > }; > > diff --git mmotm-2014-05-21-16-57.orig/mm/madvise.c mmotm-2014-05-21-16-57/mm/madvise.c > index a402f8fdc68e..06b390a6fbbd 100644 > --- mmotm-2014-05-21-16-57.orig/mm/madvise.c > +++ mmotm-2014-05-21-16-57/mm/madvise.c > @@ -135,38 +135,31 @@ static long madvise_behavior(struct vm_area_struct *vma, > } > > #ifdef CONFIG_SWAP > -static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, > +/* > + * Assuming that page table walker holds page table lock. > + */ > +static int swapin_walk_pte_entry(pte_t *pte, unsigned long start, > unsigned long end, struct mm_walk *walk) > { > - pte_t *orig_pte; > - struct vm_area_struct *vma = walk->private; > - unsigned long index; > - > - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) > - return 0; > - > - for (index = start; index != end; index += PAGE_SIZE) { > - pte_t pte; > - swp_entry_t entry; > - struct page *page; > - spinlock_t *ptl; > - > - orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); > - pte = *(orig_pte + ((index - start) / PAGE_SIZE)); > - pte_unmap_unlock(orig_pte, ptl); > - > - if (pte_present(pte) || pte_none(pte) || pte_file(pte)) > - continue; > - entry = pte_to_swp_entry(pte); > - if (unlikely(non_swap_entry(entry))) > - continue; > - > - page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, > - vma, index); > - if (page) > - page_cache_release(page); > - } > + pte_t ptent; > + pte_t *orig_pte = pte - ((start & (PMD_SIZE - 1)) >> PAGE_SHIFT); > + swp_entry_t entry; > + struct page *page; > > + ptent = *pte; > + pte_unmap_unlock(orig_pte, walk->ptl); > + if (pte_present(ptent) || pte_none(ptent) || pte_file(ptent)) > + goto lock; > + entry = pte_to_swp_entry(ptent); > + if (unlikely(non_swap_entry(entry))) > + goto lock; > + page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, > + walk->vma, start); > + if (page) > + page_cache_release(page); > +lock: > + pte_offset_map(walk->pmd, start & PMD_MASK); > + spin_lock(walk->ptl); > return 0; > } > > @@ -175,8 +168,7 @@ static void force_swapin_readahead(struct vm_area_struct *vma, > { > struct mm_walk walk = { > .mm = vma->vm_mm, > - .pmd_entry = swapin_walk_pmd_entry, > - .private = vma, > + .pte_entry = swapin_walk_pte_entry, > }; > > walk_page_range(start, end, &walk); > diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c > index e734f63276c2..24311d6f5c20 100644 > --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c > +++ mmotm-2014-05-21-16-57/mm/pagewalk.c > @@ -27,10 +27,10 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, > struct mm_struct *mm = walk->mm; > pte_t *pte; > pte_t *orig_pte; > - spinlock_t *ptl; > int err = 0; > > - orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); > + walk->pmd = pmd; > + orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &walk->ptl); > do { > if (pte_none(*pte)) { > if (walk->pte_hole) > @@ -48,7 +48,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, > if (err) > break; > } while (pte++, addr += PAGE_SIZE, addr < end); > - pte_unmap_unlock(orig_pte, ptl); > + pte_unmap_unlock(orig_pte, walk->ptl); > cond_resched(); > return addr == end ? 0 : err; > } > @@ -172,7 +172,6 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, > unsigned long hmask = huge_page_mask(h); > pte_t *pte; > int err = 0; > - spinlock_t *ptl; > > do { > next = hugetlb_entry_end(h, addr, end); > @@ -186,14 +185,14 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, > break; > continue; > } > - ptl = huge_pte_lock(h, mm, pte); > + walk->ptl = huge_pte_lock(h, mm, pte); > /* > * Callers should have their own way to handle swap entries > * in walk->hugetlb_entry(). > */ > if (walk->hugetlb_entry) > err = walk->hugetlb_entry(pte, addr, next, walk); > - spin_unlock(ptl); > + spin_unlock(walk->ptl); > if (err) > break; > } while (addr = next, addr != end); > -- > 1.9.3 > > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755482AbaFPQAF (ORCPT ); Mon, 16 Jun 2014 12:00:05 -0400 Received: from mx1.redhat.com ([209.132.183.28]:7768 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751448AbaFPQAE (ORCPT ); Mon, 16 Jun 2014 12:00:04 -0400 Date: Mon, 16 Jun 2014 11:59:50 -0400 From: Naoya Horiguchi To: Hugh Dickins Cc: linux-mm@kvack.org, Dave Hansen , Andrew Morton , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: Re: [PATCH -mm v2 02/11] madvise: cleanup swapin_walk_pmd_entry() Message-ID: <20140616155950.GA13264@nhori.bos.redhat.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1402609691-13950-3-git-send-email-n-horiguchi@ah.jp.nec.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Jun 15, 2014 at 01:24:30PM -0700, Hugh Dickins wrote: > On Thu, 12 Jun 2014, Naoya Horiguchi wrote: > > > With the recent update on page table walker, we can use common code for > > the walking more. Unlike many other users, this swapin_walk expects to > > handle swap entries. As a result we should be careful about ptl locking. > > Swapin operation, read_swap_cache_async(), could cause page reclaim, so > > we can't keep holding ptl throughout this pte loop. > > In order to properly handle ptl in pte_entry(), this patch adds two new > > members on struct mm_walk. > > > > This cleanup is necessary to get to the final form of page table walker, > > where we should do all caller's specific work on leaf entries (IOW, all > > pmd_entry() should be used for trans_pmd.) > > > > Signed-off-by: Naoya Horiguchi > > Cc: Hugh Dickins > > Sorry, I believe this (and probably several other of your conversions) > is badly flawed. > > You have a pattern of doing pte_offset_map_lock() inside the page walker, > then dropping and regetting map and lock inside your pte handler. > > But on, say, x86_32 with CONFIG_HIGHMEM, CONFIG_SMP and CONFIG_PREEMPT, > you may be preempted then run on a different cpu while atomic kmap and > lock are dropped: so that the pte pointer then used on return to > walk_pte_range() will no longer correspond to the right mapping. Sorry, I didn't handle it correctly. > Presumably that can be fixed by keeping the pte pointer in the mm_walk > structure; but I'm not at all sure that's the right thing to do. orig_pte should be updated if we call pte_offset_map_lock() or pte_offset_map() inside the callback. So moving orig_pte into mm_walk is what I think I'll do in the next post. > I am not nearly so keen as you to reduce all these to per-pte callouts, > which seem inefficient to me. Right, we can't do inlining for callbacks, so calling callbacks more is slower than open code. To make it better, code generator which creates open page walk code for each users at build time might be one option. Standardization done in patchset is the first step for doing it. > It can be argued both ways on the less > important functions (like this madvise one); but I hope you don't try > to make this kind of conversion to fast paths like those in memory.c. OK. I never touch fast paths until performance concern is solved. Thanks, Naoya Horiguchi > Hugh > > > --- > > include/linux/mm.h | 4 ++++ > > mm/madvise.c | 54 +++++++++++++++++++++++------------------------------- > > mm/pagewalk.c | 11 +++++------ > > 3 files changed, 32 insertions(+), 37 deletions(-) > > > > diff --git mmotm-2014-05-21-16-57.orig/include/linux/mm.h mmotm-2014-05-21-16-57/include/linux/mm.h > > index b4aa6579f2b1..aa832161a1ff 100644 > > --- mmotm-2014-05-21-16-57.orig/include/linux/mm.h > > +++ mmotm-2014-05-21-16-57/include/linux/mm.h > > @@ -1108,6 +1108,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, > > * @vma: vma currently walked > > * @skip: internal control flag which is set when we skip the lower > > * level entries. > > + * @pmd: current pmd entry > > + * @ptl: page table lock associated with current entry > > * @private: private data for callbacks' use > > * > > * (see the comment on walk_page_range() for more details) > > @@ -1126,6 +1128,8 @@ struct mm_walk { > > struct mm_struct *mm; > > struct vm_area_struct *vma; > > int skip; > > + pmd_t *pmd; > > + spinlock_t *ptl; > > void *private; > > }; > > > > diff --git mmotm-2014-05-21-16-57.orig/mm/madvise.c mmotm-2014-05-21-16-57/mm/madvise.c > > index a402f8fdc68e..06b390a6fbbd 100644 > > --- mmotm-2014-05-21-16-57.orig/mm/madvise.c > > +++ mmotm-2014-05-21-16-57/mm/madvise.c > > @@ -135,38 +135,31 @@ static long madvise_behavior(struct vm_area_struct *vma, > > } > > > > #ifdef CONFIG_SWAP > > -static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, > > +/* > > + * Assuming that page table walker holds page table lock. > > + */ > > +static int swapin_walk_pte_entry(pte_t *pte, unsigned long start, > > unsigned long end, struct mm_walk *walk) > > { > > - pte_t *orig_pte; > > - struct vm_area_struct *vma = walk->private; > > - unsigned long index; > > - > > - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) > > - return 0; > > - > > - for (index = start; index != end; index += PAGE_SIZE) { > > - pte_t pte; > > - swp_entry_t entry; > > - struct page *page; > > - spinlock_t *ptl; > > - > > - orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); > > - pte = *(orig_pte + ((index - start) / PAGE_SIZE)); > > - pte_unmap_unlock(orig_pte, ptl); > > - > > - if (pte_present(pte) || pte_none(pte) || pte_file(pte)) > > - continue; > > - entry = pte_to_swp_entry(pte); > > - if (unlikely(non_swap_entry(entry))) > > - continue; > > - > > - page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, > > - vma, index); > > - if (page) > > - page_cache_release(page); > > - } > > + pte_t ptent; > > + pte_t *orig_pte = pte - ((start & (PMD_SIZE - 1)) >> PAGE_SHIFT); > > + swp_entry_t entry; > > + struct page *page; > > > > + ptent = *pte; > > + pte_unmap_unlock(orig_pte, walk->ptl); > > + if (pte_present(ptent) || pte_none(ptent) || pte_file(ptent)) > > + goto lock; > > + entry = pte_to_swp_entry(ptent); > > + if (unlikely(non_swap_entry(entry))) > > + goto lock; > > + page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, > > + walk->vma, start); > > + if (page) > > + page_cache_release(page); > > +lock: > > + pte_offset_map(walk->pmd, start & PMD_MASK); > > + spin_lock(walk->ptl); > > return 0; > > } > > > > @@ -175,8 +168,7 @@ static void force_swapin_readahead(struct vm_area_struct *vma, > > { > > struct mm_walk walk = { > > .mm = vma->vm_mm, > > - .pmd_entry = swapin_walk_pmd_entry, > > - .private = vma, > > + .pte_entry = swapin_walk_pte_entry, > > }; > > > > walk_page_range(start, end, &walk); > > diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c > > index e734f63276c2..24311d6f5c20 100644 > > --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c > > +++ mmotm-2014-05-21-16-57/mm/pagewalk.c > > @@ -27,10 +27,10 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, > > struct mm_struct *mm = walk->mm; > > pte_t *pte; > > pte_t *orig_pte; > > - spinlock_t *ptl; > > int err = 0; > > > > - orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); > > + walk->pmd = pmd; > > + orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &walk->ptl); > > do { > > if (pte_none(*pte)) { > > if (walk->pte_hole) > > @@ -48,7 +48,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, > > if (err) > > break; > > } while (pte++, addr += PAGE_SIZE, addr < end); > > - pte_unmap_unlock(orig_pte, ptl); > > + pte_unmap_unlock(orig_pte, walk->ptl); > > cond_resched(); > > return addr == end ? 0 : err; > > } > > @@ -172,7 +172,6 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, > > unsigned long hmask = huge_page_mask(h); > > pte_t *pte; > > int err = 0; > > - spinlock_t *ptl; > > > > do { > > next = hugetlb_entry_end(h, addr, end); > > @@ -186,14 +185,14 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end, > > break; > > continue; > > } > > - ptl = huge_pte_lock(h, mm, pte); > > + walk->ptl = huge_pte_lock(h, mm, pte); > > /* > > * Callers should have their own way to handle swap entries > > * in walk->hugetlb_entry(). > > */ > > if (walk->hugetlb_entry) > > err = walk->hugetlb_entry(pte, addr, next, walk); > > - spin_unlock(ptl); > > + spin_unlock(walk->ptl); > > if (err) > > break; > > } while (addr = next, addr != end); > > -- > > 1.9.3 > > > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933165AbaFQO2M (ORCPT ); Tue, 17 Jun 2014 10:28:12 -0400 Received: from mx1.redhat.com ([209.132.183.28]:43692 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932985AbaFQO2J (ORCPT ); Tue, 17 Jun 2014 10:28:09 -0400 Message-ID: <53A0506C.6040609@redhat.com> Date: Tue, 17 Jun 2014 16:27:56 +0200 From: Jerome Marchand User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 MIME-Version: 1.0 To: Naoya Horiguchi , linux-mm@kvack.org CC: Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: Re: [PATCH -mm v2 04/11] pagewalk: move pmd_trans_huge_lock() from callbacks to common code References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1402609691-13950-5-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1402609691-13950-5-git-send-email-n-horiguchi@ah.jp.nec.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/12/2014 11:48 PM, Naoya Horiguchi wrote: > Now all of current users of page table walker are canonicalized, i.e. > pmd_entry() handles only trans_pmd entry, and pte_entry() handles pte entry. > So we can factorize common code more. > This patch moves pmd_trans_huge_lock() in each pmd_entry() to pagewalk core. > > ChangeLog v2: > - add null check walk->vma in walk_pmd_range() An older version of this patch already made it to linux-next (commit b0e08c5) and I've actually hit the NULL pointer dereference. Moreover, that patch (or maybe another recent pagewalk patch) breaks /proc//smaps. All fields that should have been filled by smaps_pte() are almost always zero (and when it isn't, it's always a multiple of 2MB). It seems to me that the page walk never goes below pmd level. Jerome > - move comment update into a separate patch > > Signed-off-by: Naoya Horiguchi > --- > arch/powerpc/mm/subpage-prot.c | 2 ++ > fs/proc/task_mmu.c | 66 ++++++++++++++---------------------------- > mm/memcontrol.c | 55 ++++++++++------------------------- > mm/pagewalk.c | 18 ++++++++++-- > 4 files changed, 55 insertions(+), 86 deletions(-) > > diff --git mmotm-2014-05-21-16-57.orig/arch/powerpc/mm/subpage-prot.c mmotm-2014-05-21-16-57/arch/powerpc/mm/subpage-prot.c > index fa9fb5b4c66c..d0d94ac606f3 100644 > --- mmotm-2014-05-21-16-57.orig/arch/powerpc/mm/subpage-prot.c > +++ mmotm-2014-05-21-16-57/arch/powerpc/mm/subpage-prot.c > @@ -135,7 +135,9 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr, > unsigned long end, struct mm_walk *walk) > { > struct vm_area_struct *vma = walk->vma; > + spin_unlock(walk->ptl); > split_huge_page_pmd(vma, addr, pmd); > + spin_lock(walk->ptl); > return 0; > } > > diff --git mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c mmotm-2014-05-21-16-57/fs/proc/task_mmu.c > index fa6d6a4e85b3..059206ea3c6b 100644 > --- mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c > +++ mmotm-2014-05-21-16-57/fs/proc/task_mmu.c > @@ -496,15 +496,8 @@ static int smaps_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, > struct mm_walk *walk) > { > struct mem_size_stats *mss = walk->private; > - spinlock_t *ptl; > - > - if (pmd_trans_huge_lock(pmd, walk->vma, &ptl) == 1) { > - smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk); > - spin_unlock(ptl); > - mss->anonymous_thp += HPAGE_PMD_SIZE; > - /* don't call smaps_pte() */ > - walk->skip = 1; > - } > + smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk); > + mss->anonymous_thp += HPAGE_PMD_SIZE; > return 0; > } > > @@ -983,31 +976,21 @@ static int pagemap_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, > struct vm_area_struct *vma = walk->vma; > struct pagemapread *pm = walk->private; > pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); > - spinlock_t *ptl; > - > - if (!vma) > - return err; > - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { > - int pmd_flags2; > + int pmd_flags2; > > - if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd)) > - pmd_flags2 = __PM_SOFT_DIRTY; > - else > - pmd_flags2 = 0; > + if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd)) > + pmd_flags2 = __PM_SOFT_DIRTY; > + else > + pmd_flags2 = 0; > > - for (; addr != end; addr += PAGE_SIZE) { > - unsigned long offset; > + for (; addr != end; addr += PAGE_SIZE) { > + unsigned long offset; > > - offset = (addr & ~PAGEMAP_WALK_MASK) >> > - PAGE_SHIFT; > - thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2); > - err = add_to_pagemap(addr, &pme, pm); > - if (err) > - break; > - } > - spin_unlock(ptl); > - /* don't call pagemap_pte() */ > - walk->skip = 1; > + offset = (addr & ~PAGEMAP_WALK_MASK) >> PAGE_SHIFT; > + thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2); > + err = add_to_pagemap(addr, &pme, pm); > + if (err) > + break; > } > return err; > } > @@ -1277,20 +1260,13 @@ static int gather_pmd_stats(pmd_t *pmd, unsigned long addr, > { > struct numa_maps *md = walk->private; > struct vm_area_struct *vma = walk->vma; > - spinlock_t *ptl; > - > - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { > - pte_t huge_pte = *(pte_t *)pmd; > - struct page *page; > - > - page = can_gather_numa_stats(huge_pte, vma, addr); > - if (page) > - gather_stats(page, md, pte_dirty(huge_pte), > - HPAGE_PMD_SIZE/PAGE_SIZE); > - spin_unlock(ptl); > - /* don't call gather_pte_stats() */ > - walk->skip = 1; > - } > + pte_t huge_pte = *(pte_t *)pmd; > + struct page *page; > + > + page = can_gather_numa_stats(huge_pte, vma, addr); > + if (page) > + gather_stats(page, md, pte_dirty(huge_pte), > + HPAGE_PMD_SIZE/PAGE_SIZE); > return 0; > } > #ifdef CONFIG_HUGETLB_PAGE > diff --git mmotm-2014-05-21-16-57.orig/mm/memcontrol.c mmotm-2014-05-21-16-57/mm/memcontrol.c > index 01a66a208769..bb987cb9e043 100644 > --- mmotm-2014-05-21-16-57.orig/mm/memcontrol.c > +++ mmotm-2014-05-21-16-57/mm/memcontrol.c > @@ -6723,15 +6723,9 @@ static int mem_cgroup_count_precharge_pmd(pmd_t *pmd, > struct mm_walk *walk) > { > struct vm_area_struct *vma = walk->vma; > - spinlock_t *ptl; > - > - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { > - if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) > - mc.precharge += HPAGE_PMD_NR; > - spin_unlock(ptl); > - /* don't call mem_cgroup_count_precharge_pte() */ > - walk->skip = 1; > - } > + > + if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) > + mc.precharge += HPAGE_PMD_NR; > return 0; > } > > @@ -6952,38 +6946,21 @@ static int mem_cgroup_move_charge_pmd(pmd_t *pmd, > struct page *page; > struct page_cgroup *pc; > > - /* > - * We don't take compound_lock() here but no race with splitting thp > - * happens because: > - * - if pmd_trans_huge_lock() returns 1, the relevant thp is not > - * under splitting, which means there's no concurrent thp split, > - * - if another thread runs into split_huge_page() just after we > - * entered this if-block, the thread must wait for page table lock > - * to be unlocked in __split_huge_page_splitting(), where the main > - * part of thp split is not executed yet. > - */ > - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { > - if (mc.precharge < HPAGE_PMD_NR) { > - spin_unlock(ptl); > - return 0; > - } > - target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); > - if (target_type == MC_TARGET_PAGE) { > - page = target.page; > - if (!isolate_lru_page(page)) { > - pc = lookup_page_cgroup(page); > - if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, > - pc, mc.from, mc.to)) { > - mc.precharge -= HPAGE_PMD_NR; > - mc.moved_charge += HPAGE_PMD_NR; > - } > - putback_lru_page(page); > + if (mc.precharge < HPAGE_PMD_NR) > + return 0; > + target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); > + if (target_type == MC_TARGET_PAGE) { > + page = target.page; > + if (!isolate_lru_page(page)) { > + pc = lookup_page_cgroup(page); > + if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, > + pc, mc.from, mc.to)) { > + mc.precharge -= HPAGE_PMD_NR; > + mc.moved_charge += HPAGE_PMD_NR; > } > - put_page(page); > + putback_lru_page(page); > } > - spin_unlock(ptl); > - /* don't call mem_cgroup_move_charge_pte() */ > - walk->skip = 1; > + put_page(page); > } > return 0; > } > diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c > index 24311d6f5c20..f1a3417d0b51 100644 > --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c > +++ mmotm-2014-05-21-16-57/mm/pagewalk.c > @@ -73,8 +73,22 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, > continue; > } > > - if (walk->pmd_entry) { > - err = walk->pmd_entry(pmd, addr, next, walk); > + /* > + * We don't take compound_lock() here but no race with splitting > + * thp happens because: > + * - if pmd_trans_huge_lock() returns 1, the relevant thp is > + * not under splitting, which means there's no concurrent > + * thp split, > + * - if another thread runs into split_huge_page() just after > + * we entered this if-block, the thread must wait for page > + * table lock to be unlocked in __split_huge_page_splitting(), > + * where the main part of thp split is not executed yet. > + */ > + if (walk->pmd_entry && walk->vma) { > + if (pmd_trans_huge_lock(pmd, walk->vma, &walk->ptl) == 1) { > + err = walk->pmd_entry(pmd, addr, next, walk); > + spin_unlock(walk->ptl); > + } > if (skip_lower_level_walking(walk)) > continue; > if (err) > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933468AbaFQPC0 (ORCPT ); Tue, 17 Jun 2014 11:02:26 -0400 Received: from mx1.redhat.com ([209.132.183.28]:21691 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933047AbaFQPCZ (ORCPT ); Tue, 17 Jun 2014 11:02:25 -0400 Date: Tue, 17 Jun 2014 11:01:59 -0400 From: Naoya Horiguchi To: Jerome Marchand Cc: linux-mm@kvack.org, Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: Re: [PATCH -mm v2 04/11] pagewalk: move pmd_trans_huge_lock() from callbacks to common code Message-ID: <20140617150159.GA8524@nhori.redhat.com> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1402609691-13950-5-git-send-email-n-horiguchi@ah.jp.nec.com> <53A0506C.6040609@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <53A0506C.6040609@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 17, 2014 at 04:27:56PM +0200, Jerome Marchand wrote: > On 06/12/2014 11:48 PM, Naoya Horiguchi wrote: > > Now all of current users of page table walker are canonicalized, i.e. > > pmd_entry() handles only trans_pmd entry, and pte_entry() handles pte entry. > > So we can factorize common code more. > > This patch moves pmd_trans_huge_lock() in each pmd_entry() to pagewalk core. > > > > ChangeLog v2: > > - add null check walk->vma in walk_pmd_range() > > An older version of this patch already made it to linux-next (commit > b0e08c5) and I've actually hit the NULL pointer dereference. > > Moreover, that patch (or maybe another recent pagewalk patch) breaks > /proc//smaps. All fields that should have been filled by > smaps_pte() are almost always zero (and when it isn't, it's always a > multiple of 2MB). It seems to me that the page walk never goes below > pmd level. Agreed, I'm now thinking that forcing pte_entry() for every user is not good idea, so I'll return to the start point and just will do only the necessary changes (i.e. only iron out the vma handling problem for hugepage.) Thanks, Naoya Horiguchi > Jerome > > > - move comment update into a separate patch > > > > Signed-off-by: Naoya Horiguchi > > --- > > arch/powerpc/mm/subpage-prot.c | 2 ++ > > fs/proc/task_mmu.c | 66 ++++++++++++++---------------------------- > > mm/memcontrol.c | 55 ++++++++++------------------------- > > mm/pagewalk.c | 18 ++++++++++-- > > 4 files changed, 55 insertions(+), 86 deletions(-) > > > > diff --git mmotm-2014-05-21-16-57.orig/arch/powerpc/mm/subpage-prot.c mmotm-2014-05-21-16-57/arch/powerpc/mm/subpage-prot.c > > index fa9fb5b4c66c..d0d94ac606f3 100644 > > --- mmotm-2014-05-21-16-57.orig/arch/powerpc/mm/subpage-prot.c > > +++ mmotm-2014-05-21-16-57/arch/powerpc/mm/subpage-prot.c > > @@ -135,7 +135,9 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr, > > unsigned long end, struct mm_walk *walk) > > { > > struct vm_area_struct *vma = walk->vma; > > + spin_unlock(walk->ptl); > > split_huge_page_pmd(vma, addr, pmd); > > + spin_lock(walk->ptl); > > return 0; > > } > > > > diff --git mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c mmotm-2014-05-21-16-57/fs/proc/task_mmu.c > > index fa6d6a4e85b3..059206ea3c6b 100644 > > --- mmotm-2014-05-21-16-57.orig/fs/proc/task_mmu.c > > +++ mmotm-2014-05-21-16-57/fs/proc/task_mmu.c > > @@ -496,15 +496,8 @@ static int smaps_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, > > struct mm_walk *walk) > > { > > struct mem_size_stats *mss = walk->private; > > - spinlock_t *ptl; > > - > > - if (pmd_trans_huge_lock(pmd, walk->vma, &ptl) == 1) { > > - smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk); > > - spin_unlock(ptl); > > - mss->anonymous_thp += HPAGE_PMD_SIZE; > > - /* don't call smaps_pte() */ > > - walk->skip = 1; > > - } > > + smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk); > > + mss->anonymous_thp += HPAGE_PMD_SIZE; > > return 0; > > } > > > > @@ -983,31 +976,21 @@ static int pagemap_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, > > struct vm_area_struct *vma = walk->vma; > > struct pagemapread *pm = walk->private; > > pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); > > - spinlock_t *ptl; > > - > > - if (!vma) > > - return err; > > - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { > > - int pmd_flags2; > > + int pmd_flags2; > > > > - if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd)) > > - pmd_flags2 = __PM_SOFT_DIRTY; > > - else > > - pmd_flags2 = 0; > > + if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd)) > > + pmd_flags2 = __PM_SOFT_DIRTY; > > + else > > + pmd_flags2 = 0; > > > > - for (; addr != end; addr += PAGE_SIZE) { > > - unsigned long offset; > > + for (; addr != end; addr += PAGE_SIZE) { > > + unsigned long offset; > > > > - offset = (addr & ~PAGEMAP_WALK_MASK) >> > > - PAGE_SHIFT; > > - thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2); > > - err = add_to_pagemap(addr, &pme, pm); > > - if (err) > > - break; > > - } > > - spin_unlock(ptl); > > - /* don't call pagemap_pte() */ > > - walk->skip = 1; > > + offset = (addr & ~PAGEMAP_WALK_MASK) >> PAGE_SHIFT; > > + thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2); > > + err = add_to_pagemap(addr, &pme, pm); > > + if (err) > > + break; > > } > > return err; > > } > > @@ -1277,20 +1260,13 @@ static int gather_pmd_stats(pmd_t *pmd, unsigned long addr, > > { > > struct numa_maps *md = walk->private; > > struct vm_area_struct *vma = walk->vma; > > - spinlock_t *ptl; > > - > > - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { > > - pte_t huge_pte = *(pte_t *)pmd; > > - struct page *page; > > - > > - page = can_gather_numa_stats(huge_pte, vma, addr); > > - if (page) > > - gather_stats(page, md, pte_dirty(huge_pte), > > - HPAGE_PMD_SIZE/PAGE_SIZE); > > - spin_unlock(ptl); > > - /* don't call gather_pte_stats() */ > > - walk->skip = 1; > > - } > > + pte_t huge_pte = *(pte_t *)pmd; > > + struct page *page; > > + > > + page = can_gather_numa_stats(huge_pte, vma, addr); > > + if (page) > > + gather_stats(page, md, pte_dirty(huge_pte), > > + HPAGE_PMD_SIZE/PAGE_SIZE); > > return 0; > > } > > #ifdef CONFIG_HUGETLB_PAGE > > diff --git mmotm-2014-05-21-16-57.orig/mm/memcontrol.c mmotm-2014-05-21-16-57/mm/memcontrol.c > > index 01a66a208769..bb987cb9e043 100644 > > --- mmotm-2014-05-21-16-57.orig/mm/memcontrol.c > > +++ mmotm-2014-05-21-16-57/mm/memcontrol.c > > @@ -6723,15 +6723,9 @@ static int mem_cgroup_count_precharge_pmd(pmd_t *pmd, > > struct mm_walk *walk) > > { > > struct vm_area_struct *vma = walk->vma; > > - spinlock_t *ptl; > > - > > - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { > > - if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) > > - mc.precharge += HPAGE_PMD_NR; > > - spin_unlock(ptl); > > - /* don't call mem_cgroup_count_precharge_pte() */ > > - walk->skip = 1; > > - } > > + > > + if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE) > > + mc.precharge += HPAGE_PMD_NR; > > return 0; > > } > > > > @@ -6952,38 +6946,21 @@ static int mem_cgroup_move_charge_pmd(pmd_t *pmd, > > struct page *page; > > struct page_cgroup *pc; > > > > - /* > > - * We don't take compound_lock() here but no race with splitting thp > > - * happens because: > > - * - if pmd_trans_huge_lock() returns 1, the relevant thp is not > > - * under splitting, which means there's no concurrent thp split, > > - * - if another thread runs into split_huge_page() just after we > > - * entered this if-block, the thread must wait for page table lock > > - * to be unlocked in __split_huge_page_splitting(), where the main > > - * part of thp split is not executed yet. > > - */ > > - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { > > - if (mc.precharge < HPAGE_PMD_NR) { > > - spin_unlock(ptl); > > - return 0; > > - } > > - target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); > > - if (target_type == MC_TARGET_PAGE) { > > - page = target.page; > > - if (!isolate_lru_page(page)) { > > - pc = lookup_page_cgroup(page); > > - if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, > > - pc, mc.from, mc.to)) { > > - mc.precharge -= HPAGE_PMD_NR; > > - mc.moved_charge += HPAGE_PMD_NR; > > - } > > - putback_lru_page(page); > > + if (mc.precharge < HPAGE_PMD_NR) > > + return 0; > > + target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); > > + if (target_type == MC_TARGET_PAGE) { > > + page = target.page; > > + if (!isolate_lru_page(page)) { > > + pc = lookup_page_cgroup(page); > > + if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, > > + pc, mc.from, mc.to)) { > > + mc.precharge -= HPAGE_PMD_NR; > > + mc.moved_charge += HPAGE_PMD_NR; > > } > > - put_page(page); > > + putback_lru_page(page); > > } > > - spin_unlock(ptl); > > - /* don't call mem_cgroup_move_charge_pte() */ > > - walk->skip = 1; > > + put_page(page); > > } > > return 0; > > } > > diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c > > index 24311d6f5c20..f1a3417d0b51 100644 > > --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c > > +++ mmotm-2014-05-21-16-57/mm/pagewalk.c > > @@ -73,8 +73,22 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, > > continue; > > } > > > > - if (walk->pmd_entry) { > > - err = walk->pmd_entry(pmd, addr, next, walk); > > + /* > > + * We don't take compound_lock() here but no race with splitting > > + * thp happens because: > > + * - if pmd_trans_huge_lock() returns 1, the relevant thp is > > + * not under splitting, which means there's no concurrent > > + * thp split, > > + * - if another thread runs into split_huge_page() just after > > + * we entered this if-block, the thread must wait for page > > + * table lock to be unlocked in __split_huge_page_splitting(), > > + * where the main part of thp split is not executed yet. > > + */ > > + if (walk->pmd_entry && walk->vma) { > > + if (pmd_trans_huge_lock(pmd, walk->vma, &walk->ptl) == 1) { > > + err = walk->pmd_entry(pmd, addr, next, walk); > > + spin_unlock(walk->ptl); > > + } > > if (skip_lower_level_walking(walk)) > > continue; > > if (err) > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752390AbaFRPOL (ORCPT ); Wed, 18 Jun 2014 11:14:11 -0400 Received: from mx1.redhat.com ([209.132.183.28]:29039 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751145AbaFRPOJ (ORCPT ); Wed, 18 Jun 2014 11:14:09 -0400 Message-ID: <53A1ACB1.3050102@redhat.com> Date: Wed, 18 Jun 2014 17:13:53 +0200 From: Jerome Marchand User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 MIME-Version: 1.0 To: Naoya Horiguchi CC: linux-mm@kvack.org, Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: Re: [PATCH -mm v2 04/11] pagewalk: move pmd_trans_huge_lock() from callbacks to common code References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1402609691-13950-5-git-send-email-n-horiguchi@ah.jp.nec.com> <53A0506C.6040609@redhat.com> <20140617150159.GA8524@nhori.redhat.com> In-Reply-To: <20140617150159.GA8524@nhori.redhat.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/17/2014 05:01 PM, Naoya Horiguchi wrote: > On Tue, Jun 17, 2014 at 04:27:56PM +0200, Jerome Marchand wrote: >> On 06/12/2014 11:48 PM, Naoya Horiguchi wrote: >>> Now all of current users of page table walker are canonicalized, i.e. >>> pmd_entry() handles only trans_pmd entry, and pte_entry() handles pte entry. >>> So we can factorize common code more. >>> This patch moves pmd_trans_huge_lock() in each pmd_entry() to pagewalk core. >>> >>> ChangeLog v2: >>> - add null check walk->vma in walk_pmd_range() >> >> An older version of this patch already made it to linux-next (commit >> b0e08c5) and I've actually hit the NULL pointer dereference. >> >> Moreover, that patch (or maybe another recent pagewalk patch) breaks >> /proc//smaps. All fields that should have been filled by >> smaps_pte() are almost always zero (and when it isn't, it's always a >> multiple of 2MB). It seems to me that the page walk never goes below >> pmd level. > > Agreed, I'm now thinking that forcing pte_entry() for every user is not > good idea, so I'll return to the start point and just will do only the > necessary changes (i.e. only iron out the vma handling problem for hugepage.) > > Thanks, > Naoya Horiguchi > >> Jerome >> >>> - move comment update into a separate patch >>> >>> Signed-off-by: Naoya Horiguchi >>> --- >>> diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c >>> index 24311d6f5c20..f1a3417d0b51 100644 >>> --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c >>> +++ mmotm-2014-05-21-16-57/mm/pagewalk.c >>> @@ -73,8 +73,22 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, >>> continue; >>> } >>> >>> - if (walk->pmd_entry) { >>> - err = walk->pmd_entry(pmd, addr, next, walk); >>> + /* >>> + * We don't take compound_lock() here but no race with splitting >>> + * thp happens because: >>> + * - if pmd_trans_huge_lock() returns 1, the relevant thp is >>> + * not under splitting, which means there's no concurrent >>> + * thp split, >>> + * - if another thread runs into split_huge_page() just after >>> + * we entered this if-block, the thread must wait for page >>> + * table lock to be unlocked in __split_huge_page_splitting(), >>> + * where the main part of thp split is not executed yet. >>> + */ >>> + if (walk->pmd_entry && walk->vma) { >>> + if (pmd_trans_huge_lock(pmd, walk->vma, &walk->ptl) == 1) { >>> + err = walk->pmd_entry(pmd, addr, next, walk); >>> + spin_unlock(walk->ptl); >>> + } >>> if (skip_lower_level_walking(walk)) >>> continue; >>> if (err) This is the cause of the smaps trouble. This code modifies walk->control when pmd_entry() is present, even when it is not called. All the control code should depend on pmd_trans_huge_lock() == 1 too. Jerome From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752927AbaFRPb7 (ORCPT ); Wed, 18 Jun 2014 11:31:59 -0400 Received: from mx1.redhat.com ([209.132.183.28]:56655 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751829AbaFRPb6 (ORCPT ); Wed, 18 Jun 2014 11:31:58 -0400 Date: Wed, 18 Jun 2014 11:31:45 -0400 From: Naoya Horiguchi To: Jerome Marchand Cc: linux-mm@kvack.org, Dave Hansen , Andrew Morton , Hugh Dickins , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org Subject: Re: [PATCH -mm v2 04/11] pagewalk: move pmd_trans_huge_lock() from callbacks to common code Message-ID: <20140618153145.GA26866@nhori> References: <1402609691-13950-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1402609691-13950-5-git-send-email-n-horiguchi@ah.jp.nec.com> <53A0506C.6040609@redhat.com> <20140617150159.GA8524@nhori.redhat.com> <53A1ACB1.3050102@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <53A1ACB1.3050102@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 18, 2014 at 05:13:53PM +0200, Jerome Marchand wrote: > On 06/17/2014 05:01 PM, Naoya Horiguchi wrote: > > On Tue, Jun 17, 2014 at 04:27:56PM +0200, Jerome Marchand wrote: > >> On 06/12/2014 11:48 PM, Naoya Horiguchi wrote: > >>> Now all of current users of page table walker are canonicalized, i.e. > >>> pmd_entry() handles only trans_pmd entry, and pte_entry() handles pte entry. > >>> So we can factorize common code more. > >>> This patch moves pmd_trans_huge_lock() in each pmd_entry() to pagewalk core. > >>> > >>> ChangeLog v2: > >>> - add null check walk->vma in walk_pmd_range() > >> > >> An older version of this patch already made it to linux-next (commit > >> b0e08c5) and I've actually hit the NULL pointer dereference. > >> > >> Moreover, that patch (or maybe another recent pagewalk patch) breaks > >> /proc//smaps. All fields that should have been filled by > >> smaps_pte() are almost always zero (and when it isn't, it's always a > >> multiple of 2MB). It seems to me that the page walk never goes below > >> pmd level. > > > > Agreed, I'm now thinking that forcing pte_entry() for every user is not > > good idea, so I'll return to the start point and just will do only the > > necessary changes (i.e. only iron out the vma handling problem for hugepage.) > > > > Thanks, > > Naoya Horiguchi > > > >> Jerome > >> > >>> - move comment update into a separate patch > >>> > >>> Signed-off-by: Naoya Horiguchi > >>> --- > > > >>> diff --git mmotm-2014-05-21-16-57.orig/mm/pagewalk.c mmotm-2014-05-21-16-57/mm/pagewalk.c > >>> index 24311d6f5c20..f1a3417d0b51 100644 > >>> --- mmotm-2014-05-21-16-57.orig/mm/pagewalk.c > >>> +++ mmotm-2014-05-21-16-57/mm/pagewalk.c > >>> @@ -73,8 +73,22 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, > >>> continue; > >>> } > >>> > >>> - if (walk->pmd_entry) { > >>> - err = walk->pmd_entry(pmd, addr, next, walk); > >>> + /* > >>> + * We don't take compound_lock() here but no race with splitting > >>> + * thp happens because: > >>> + * - if pmd_trans_huge_lock() returns 1, the relevant thp is > >>> + * not under splitting, which means there's no concurrent > >>> + * thp split, > >>> + * - if another thread runs into split_huge_page() just after > >>> + * we entered this if-block, the thread must wait for page > >>> + * table lock to be unlocked in __split_huge_page_splitting(), > >>> + * where the main part of thp split is not executed yet. > >>> + */ > >>> + if (walk->pmd_entry && walk->vma) { > >>> + if (pmd_trans_huge_lock(pmd, walk->vma, &walk->ptl) == 1) { > >>> + err = walk->pmd_entry(pmd, addr, next, walk); > >>> + spin_unlock(walk->ptl); > >>> + } > >>> if (skip_lower_level_walking(walk)) > >>> continue; > >>> if (err) > > This is the cause of the smaps trouble. This code modifies walk->control > when pmd_entry() is present, even when it is not called. All the control > code should depend on pmd_trans_huge_lock() == 1 too. Thank you for pointing out, I have a few objection about doing aggressive cleanup around this code, and now I'm preparing the next version which does minimum cleanup without walk->control stuff, so I hope your concern will be gone in it. Thanks, Naoya Horiguchi