From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yh0-f53.google.com (mail-yh0-f53.google.com [209.85.213.53]) by kanga.kvack.org (Postfix) with ESMTP id 5A5816B006E for ; Tue, 13 Jan 2015 14:14:33 -0500 (EST) Received: by mail-yh0-f53.google.com with SMTP id i57so2390715yha.12 for ; Tue, 13 Jan 2015 11:14:33 -0800 (PST) Received: from mga11.intel.com (mga11.intel.com. [192.55.52.93]) by mx.google.com with ESMTP id n132si11291398ykc.92.2015.01.13.11.14.32 for ; Tue, 13 Jan 2015 11:14:32 -0800 (PST) From: "Kirill A. Shutemov" Subject: [PATCH 0/2] Account PMD page tables to the process Date: Tue, 13 Jan 2015 21:14:14 +0200 Message-Id: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Hugh Dickins Cc: linux-mm@kvack.org, Dave Hansen , Cyrill Gorcunov , Pavel Emelyanov , linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Currently we don't account PMD page tables to the process. It can lead to local DoS: unprivileged user can allocate >500 MiB on x86_64 per process without being noticed by oom-killer or memory cgroup. Proposed fix adds accounting for PMD table the same way we account for PTE tables. There're few corner case in the accounting (see patch 2/2) which have not well tested yet. If anybody know any other cases we should handle, please let me know. Kirill A. Shutemov (2): mm: rename mm->nr_ptes to mm->nr_pgtables mm: account pmd page tables to the process Documentation/sysctl/vm.txt | 2 +- arch/x86/mm/pgtable.c | 13 ++++++++----- fs/proc/task_mmu.c | 2 +- include/linux/mm_types.h | 2 +- kernel/fork.c | 2 +- mm/debug.c | 4 ++-- mm/huge_memory.c | 10 +++++----- mm/hugetlb.c | 8 ++++++-- mm/memory.c | 6 ++++-- mm/mmap.c | 9 +++++++-- mm/oom_kill.c | 8 ++++---- 11 files changed, 40 insertions(+), 26 deletions(-) -- 2.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yh0-f48.google.com (mail-yh0-f48.google.com [209.85.213.48]) by kanga.kvack.org (Postfix) with ESMTP id 338E66B0070 for ; Tue, 13 Jan 2015 14:14:34 -0500 (EST) Received: by mail-yh0-f48.google.com with SMTP id i57so2402085yha.7 for ; Tue, 13 Jan 2015 11:14:34 -0800 (PST) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTP id x47si8838475yhd.120.2015.01.13.11.14.32 for ; Tue, 13 Jan 2015 11:14:32 -0800 (PST) From: "Kirill A. Shutemov" Subject: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables Date: Tue, 13 Jan 2015 21:14:15 +0200 Message-Id: <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> In-Reply-To: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Hugh Dickins Cc: linux-mm@kvack.org, Dave Hansen , Cyrill Gorcunov , Pavel Emelyanov , linux-kernel@vger.kernel.org, "Kirill A. Shutemov" We're going to account pmd page tables too. Let's rename mm->nr_pgtables to something more generic. Signed-off-by: Kirill A. Shutemov --- Documentation/sysctl/vm.txt | 2 +- fs/proc/task_mmu.c | 2 +- include/linux/mm_types.h | 2 +- kernel/fork.c | 2 +- mm/debug.c | 4 ++-- mm/huge_memory.c | 10 +++++----- mm/memory.c | 4 ++-- mm/mmap.c | 2 +- mm/oom_kill.c | 8 ++++---- 9 files changed, 18 insertions(+), 18 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 4415aa915681..0211a58ee6c3 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -557,7 +557,7 @@ oom_dump_tasks Enables a system-wide task dump (excluding kernel threads) to be produced when the kernel performs an OOM-killing and includes such -information as pid, uid, tgid, vm size, rss, nr_ptes, swapents, +information as pid, uid, tgid, vm size, rss, nr_pgtables, swapents, oom_score_adj score, and name. This is helpful to determine why the OOM killer was invoked, to identify the rogue task that caused it, and to determine why the OOM killer chose the task it did to kill. diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 8faae6fed085..6121cca220cb 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -64,7 +64,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) data << (PAGE_SHIFT-10), mm->stack_vm << (PAGE_SHIFT-10), text, lib, (PTRS_PER_PTE * sizeof(pte_t) * - atomic_long_read(&mm->nr_ptes)) >> 10, + atomic_long_read(&mm->nr_pgtables)) >> 10, swap << (PAGE_SHIFT-10)); } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 20ff2105b564..106be259a3ea 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -363,7 +363,7 @@ struct mm_struct { pgd_t * pgd; atomic_t mm_users; /* How many users with user space? */ atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ - atomic_long_t nr_ptes; /* Page table pages */ + atomic_long_t nr_pgtables; /* Page table pages */ int map_count; /* number of VMAs */ spinlock_t page_table_lock; /* Protects page tables and some counters */ diff --git a/kernel/fork.c b/kernel/fork.c index b379d9abddc7..d7d00cac1f66 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -554,7 +554,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) init_rwsem(&mm->mmap_sem); INIT_LIST_HEAD(&mm->mmlist); mm->core_state = NULL; - atomic_long_set(&mm->nr_ptes, 0); + atomic_long_set(&mm->nr_pgtables, 0); mm->map_count = 0; mm->locked_vm = 0; mm->pinned_vm = 0; diff --git a/mm/debug.c b/mm/debug.c index d69cb5a7ba9a..229d3cba5677 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -173,7 +173,7 @@ void dump_mm(const struct mm_struct *mm) "get_unmapped_area %p\n" #endif "mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n" - "pgd %p mm_users %d mm_count %d nr_ptes %lu map_count %d\n" + "pgd %p mm_users %d mm_count %d nr_pgtables %lu map_count %d\n" "hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n" "pinned_vm %lx shared_vm %lx exec_vm %lx stack_vm %lx\n" "start_code %lx end_code %lx start_data %lx end_data %lx\n" @@ -205,7 +205,7 @@ void dump_mm(const struct mm_struct *mm) mm->mmap_base, mm->mmap_legacy_base, mm->highest_vm_end, mm->pgd, atomic_read(&mm->mm_users), atomic_read(&mm->mm_count), - atomic_long_read((atomic_long_t *)&mm->nr_ptes), + atomic_long_read((atomic_long_t *)&mm->nr_pgtables), mm->map_count, mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm, mm->pinned_vm, mm->shared_vm, mm->exec_vm, mm->stack_vm, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b29c48708b89..6ad66bc4da88 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -749,7 +749,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm, pgtable_trans_huge_deposit(mm, pmd, pgtable); set_pmd_at(mm, haddr, pmd, entry); add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR); - atomic_long_inc(&mm->nr_ptes); + atomic_long_inc(&mm->nr_pgtables); spin_unlock(ptl); } @@ -782,7 +782,7 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm, entry = pmd_mkhuge(entry); pgtable_trans_huge_deposit(mm, pmd, pgtable); set_pmd_at(mm, haddr, pmd, entry); - atomic_long_inc(&mm->nr_ptes); + atomic_long_inc(&mm->nr_pgtables); return true; } @@ -905,7 +905,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd = pmd_mkold(pmd_wrprotect(pmd)); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); set_pmd_at(dst_mm, addr, dst_pmd, pmd); - atomic_long_inc(&dst_mm->nr_ptes); + atomic_long_inc(&dst_mm->nr_pgtables); ret = 0; out_unlock: @@ -1429,7 +1429,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, tlb_remove_pmd_tlb_entry(tlb, pmd, addr); pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd); if (is_huge_zero_pmd(orig_pmd)) { - atomic_long_dec(&tlb->mm->nr_ptes); + atomic_long_dec(&tlb->mm->nr_pgtables); spin_unlock(ptl); put_huge_zero_page(); } else { @@ -1438,7 +1438,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, VM_BUG_ON_PAGE(page_mapcount(page) < 0, page); add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); VM_BUG_ON_PAGE(!PageHead(page), page); - atomic_long_dec(&tlb->mm->nr_ptes); + atomic_long_dec(&tlb->mm->nr_pgtables); spin_unlock(ptl); tlb_remove_page(tlb, page); } diff --git a/mm/memory.c b/mm/memory.c index 5afb6d89ac96..2f9ee3089c20 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -394,7 +394,7 @@ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, pgtable_t token = pmd_pgtable(*pmd); pmd_clear(pmd); pte_free_tlb(tlb, token, addr); - atomic_long_dec(&tlb->mm->nr_ptes); + atomic_long_dec(&tlb->mm->nr_pgtables); } static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, @@ -586,7 +586,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, ptl = pmd_lock(mm, pmd); wait_split_huge_page = 0; if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ - atomic_long_inc(&mm->nr_ptes); + atomic_long_inc(&mm->nr_pgtables); pmd_populate(mm, pmd, new); new = NULL; } else if (unlikely(pmd_trans_splitting(*pmd))) diff --git a/mm/mmap.c b/mm/mmap.c index 14d84666e8ba..3c591112263d 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2852,7 +2852,7 @@ void exit_mmap(struct mm_struct *mm) } vm_unacct_memory(nr_accounted); - WARN_ON(atomic_long_read(&mm->nr_ptes) > + WARN_ON(atomic_long_read(&mm->nr_pgtables) > (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 294493a7ae4b..d121ee7fa357 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -169,7 +169,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, * The baseline for the badness score is the proportion of RAM that each * task's rss, pagetable and swap space use. */ - points = get_mm_rss(p->mm) + atomic_long_read(&p->mm->nr_ptes) + + points = get_mm_rss(p->mm) + atomic_long_read(&p->mm->nr_pgtables) + get_mm_counter(p->mm, MM_SWAPENTS); task_unlock(p); @@ -345,7 +345,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, * Dumps the current memory state of all eligible tasks. Tasks not in the same * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes * are not shown. - * State information includes task's pid, uid, tgid, vm size, rss, nr_ptes, + * State information includes task's pid, uid, tgid, vm size, rss, nr_pgtables, * swapents, oom_score_adj value, and name. */ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) @@ -353,7 +353,7 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) struct task_struct *p; struct task_struct *task; - pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n"); + pr_info("[ pid ] uid tgid total_vm rss nr_pgtables swapents oom_score_adj name\n"); rcu_read_lock(); for_each_process(p) { if (oom_unkillable_task(p, memcg, nodemask)) @@ -372,7 +372,7 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) pr_info("[%5d] %5d %5d %8lu %8lu %7ld %8lu %5hd %s\n", task->pid, from_kuid(&init_user_ns, task_uid(task)), task->tgid, task->mm->total_vm, get_mm_rss(task->mm), - atomic_long_read(&task->mm->nr_ptes), + atomic_long_read(&task->mm->nr_pgtables), get_mm_counter(task->mm, MM_SWAPENTS), task->signal->oom_score_adj, task->comm); task_unlock(task); -- 2.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f181.google.com (mail-pd0-f181.google.com [209.85.192.181]) by kanga.kvack.org (Postfix) with ESMTP id 52AE56B0071 for ; Tue, 13 Jan 2015 14:15:04 -0500 (EST) Received: by mail-pd0-f181.google.com with SMTP id v10so4983524pde.12 for ; Tue, 13 Jan 2015 11:15:04 -0800 (PST) Received: from mga01.intel.com (mga01.intel.com. [192.55.52.88]) by mx.google.com with ESMTP id ko6si28113155pab.77.2015.01.13.11.15.01 for ; Tue, 13 Jan 2015 11:15:02 -0800 (PST) From: "Kirill A. Shutemov" Subject: [PATCH 2/2] mm: account pmd page tables to the process Date: Tue, 13 Jan 2015 21:14:16 +0200 Message-Id: <1421176456-21796-3-git-send-email-kirill.shutemov@linux.intel.com> In-Reply-To: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Hugh Dickins Cc: linux-mm@kvack.org, Dave Hansen , Cyrill Gorcunov , Pavel Emelyanov , linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include #include #include #include #include #include #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror("re-mmap"), exit(1); } printf("PID %d consumed %lu KiB in PMD page tables\n", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Signed-off-by: Kirill A. Shutemov Reported-by: Dave Hansen --- arch/x86/mm/pgtable.c | 13 ++++++++----- mm/hugetlb.c | 8 ++++++-- mm/memory.c | 2 ++ mm/mmap.c | 9 +++++++-- 4 files changed, 23 insertions(+), 9 deletions(-) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 6fb6927f9e76..7f7b7005a7d5 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -190,7 +190,7 @@ void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd) #endif /* CONFIG_X86_PAE */ -static void free_pmds(pmd_t *pmds[]) +static void free_pmds(struct mm_struct *mm, pmd_t *pmds[]) { int i; @@ -198,10 +198,11 @@ static void free_pmds(pmd_t *pmds[]) if (pmds[i]) { pgtable_pmd_page_dtor(virt_to_page(pmds[i])); free_page((unsigned long)pmds[i]); + atomic_long_dec(&mm->nr_pgtables); } } -static int preallocate_pmds(pmd_t *pmds[]) +static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[]) { int i; bool failed = false; @@ -215,11 +216,13 @@ static int preallocate_pmds(pmd_t *pmds[]) pmd = NULL; failed = true; } + if (pmd) + atomic_long_inc(&mm->nr_pgtables); pmds[i] = pmd; } if (failed) { - free_pmds(pmds); + free_pmds(mm, pmds); return -ENOMEM; } @@ -283,7 +286,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) mm->pgd = pgd; - if (preallocate_pmds(pmds) != 0) + if (preallocate_pmds(mm, pmds) != 0) goto out_free_pgd; if (paravirt_pgd_alloc(mm) != 0) @@ -304,7 +307,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) return pgd; out_free_pmds: - free_pmds(pmds); + free_pmds(mm, pmds); out_free_pgd: free_page((unsigned long)pgd); out: diff --git a/mm/hugetlb.c b/mm/hugetlb.c index be0e5d0db5ec..edd1278b4965 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3558,6 +3558,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) if (saddr) { spte = huge_pte_offset(svma->vm_mm, saddr); if (spte) { + atomic_long_inc(&mm->nr_pgtables); get_page(virt_to_page(spte)); break; } @@ -3569,11 +3570,13 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) ptl = huge_pte_lockptr(hstate_vma(vma), mm, spte); spin_lock(ptl); - if (pud_none(*pud)) + if (pud_none(*pud)) { pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK)); - else + } else { put_page(virt_to_page(spte)); + atomic_long_dec(&mm->nr_pgtables); + } spin_unlock(ptl); out: pte = (pte_t *)pmd_alloc(mm, pud, addr); @@ -3604,6 +3607,7 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep) pud_clear(pud); put_page(virt_to_page(ptep)); + atomic_long_dec(&mm->nr_pgtables); *addr = ALIGN(*addr, HPAGE_SIZE * PTRS_PER_PTE) - HPAGE_SIZE; return 1; } diff --git a/mm/memory.c b/mm/memory.c index 2f9ee3089c20..8b6f32e1c0b5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -428,6 +428,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, pmd = pmd_offset(pud, start); pud_clear(pud); pmd_free_tlb(tlb, pmd, start); + atomic_long_dec(&tlb->mm->nr_pgtables); } static inline void free_pud_range(struct mmu_gather *tlb, pgd_t *pgd, @@ -3321,6 +3322,7 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) smp_wmb(); /* See comment in __pte_alloc */ spin_lock(&mm->page_table_lock); + atomic_long_inc(&mm->nr_pgtables); #ifndef __ARCH_HAS_4LEVEL_HACK if (pud_present(*pud)) /* Another has populated it */ pmd_free(mm, new); diff --git a/mm/mmap.c b/mm/mmap.c index 3c591112263d..d8013a6b7ebd 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2812,6 +2812,7 @@ void exit_mmap(struct mm_struct *mm) struct mmu_gather tlb; struct vm_area_struct *vma; unsigned long nr_accounted = 0; + unsigned long max_nr_pgtables; /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); @@ -2852,8 +2853,12 @@ void exit_mmap(struct mm_struct *mm) } vm_unacct_memory(nr_accounted); - WARN_ON(atomic_long_read(&mm->nr_pgtables) > - (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT); + max_nr_pgtables = round_up(FIRST_USER_ADDRESS, PMD_SIZE) >> PMD_SHIFT; + if (!IS_ENABLED(__PAGETABLE_PMD_FOLDED)) { + max_nr_pgtables += + round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT; + } + WARN_ON(atomic_long_read(&mm->nr_pgtables) > max_nr_pgtables); } /* Insert vm structure into process list sorted by address -- 2.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vc0-f174.google.com (mail-vc0-f174.google.com [209.85.220.174]) by kanga.kvack.org (Postfix) with ESMTP id 10A9C6B0032 for ; Tue, 13 Jan 2015 15:36:33 -0500 (EST) Received: by mail-vc0-f174.google.com with SMTP id id10so1640813vcb.5 for ; Tue, 13 Jan 2015 12:36:32 -0800 (PST) Received: from mga14.intel.com (mga14.intel.com. [192.55.52.115]) by mx.google.com with ESMTP id p3si1519209vdx.87.2015.01.13.12.36.31 for ; Tue, 13 Jan 2015 12:36:31 -0800 (PST) Message-ID: <54B581C7.50206@linux.intel.com> Date: Tue, 13 Jan 2015 12:36:23 -0800 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> In-Reply-To: <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" , Andrew Morton , Hugh Dickins Cc: linux-mm@kvack.org, Cyrill Gorcunov , Pavel Emelyanov , linux-kernel@vger.kernel.org On 01/13/2015 11:14 AM, Kirill A. Shutemov wrote: > pgd_t * pgd; > atomic_t mm_users; /* How many users with user space? */ > atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ > - atomic_long_t nr_ptes; /* Page table pages */ > + atomic_long_t nr_pgtables; /* Page table pages */ > int map_count; /* number of VMAs */ One more crazy idea... There are 2^9 possible pud pages, 2^18 pmd pages and 2^27 pte pages. That's only 54 bits (technically minus one bit each because the upper half of the address space is for the kernel). That's enough to actually account for pte, pmd and pud pages separately without increasing the size of the storage we need. You could even enforce that warning you were talking about at exit time for pte pages, but just ignore pmd mismatches so you don't have false warnings on hugetlbfs shared pmd pages. Or, even better, strictly track pmd page usage _unless_ hugetlbfs shared pmds are in play and track _that_ in another bit. On 32-bit PAE, that's 2 bits for PMD pages, and 11 for PTE pages, so it should fit in an atomic_long_t there too. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f49.google.com (mail-wg0-f49.google.com [74.125.82.49]) by kanga.kvack.org (Postfix) with ESMTP id A0A5B6B006E for ; Tue, 13 Jan 2015 15:41:49 -0500 (EST) Received: by mail-wg0-f49.google.com with SMTP id n12so5240019wgh.8 for ; Tue, 13 Jan 2015 12:41:49 -0800 (PST) Received: from jenni2.inet.fi (mta-out1.inet.fi. [62.71.2.203]) by mx.google.com with ESMTP id t10si653946wif.19.2015.01.13.12.41.48 for ; Tue, 13 Jan 2015 12:41:49 -0800 (PST) Date: Tue, 13 Jan 2015 22:41:44 +0200 From: "Kirill A. Shutemov" Subject: Re: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables Message-ID: <20150113204144.GA1865@node.dhcp.inet.fi> References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> <54B581C7.50206@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54B581C7.50206@linux.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: "Kirill A. Shutemov" , Andrew Morton , Hugh Dickins , linux-mm@kvack.org, Cyrill Gorcunov , Pavel Emelyanov , linux-kernel@vger.kernel.org On Tue, Jan 13, 2015 at 12:36:23PM -0800, Dave Hansen wrote: > On 01/13/2015 11:14 AM, Kirill A. Shutemov wrote: > > pgd_t * pgd; > > atomic_t mm_users; /* How many users with user space? */ > > atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ > > - atomic_long_t nr_ptes; /* Page table pages */ > > + atomic_long_t nr_pgtables; /* Page table pages */ > > int map_count; /* number of VMAs */ > > One more crazy idea... > > There are 2^9 possible pud pages, 2^18 pmd pages and 2^27 pte pages. > That's only 54 bits (technically minus one bit each because the upper > half of the address space is for the kernel). Does this math make sense for all architecures? IA64? Power? -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f45.google.com (mail-pa0-f45.google.com [209.85.220.45]) by kanga.kvack.org (Postfix) with ESMTP id 53F766B0032 for ; Tue, 13 Jan 2015 16:35:26 -0500 (EST) Received: by mail-pa0-f45.google.com with SMTP id lf10so6015200pab.4 for ; Tue, 13 Jan 2015 13:35:26 -0800 (PST) Received: from mga01.intel.com (mga01.intel.com. [192.55.52.88]) by mx.google.com with ESMTP id z9si28242558par.226.2015.01.13.13.35.24 for ; Tue, 13 Jan 2015 13:35:25 -0800 (PST) Message-ID: <54B58F9B.4050100@linux.intel.com> Date: Tue, 13 Jan 2015 13:35:23 -0800 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> <54B581C7.50206@linux.intel.com> <20150113204144.GA1865@node.dhcp.inet.fi> In-Reply-To: <20150113204144.GA1865@node.dhcp.inet.fi> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" Cc: "Kirill A. Shutemov" , Andrew Morton , Hugh Dickins , linux-mm@kvack.org, Cyrill Gorcunov , Pavel Emelyanov , linux-kernel@vger.kernel.org On 01/13/2015 12:41 PM, Kirill A. Shutemov wrote: > On Tue, Jan 13, 2015 at 12:36:23PM -0800, Dave Hansen wrote: >> On 01/13/2015 11:14 AM, Kirill A. Shutemov wrote: >>> pgd_t * pgd; >>> atomic_t mm_users; /* How many users with user space? */ >>> atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ >>> - atomic_long_t nr_ptes; /* Page table pages */ >>> + atomic_long_t nr_pgtables; /* Page table pages */ >>> int map_count; /* number of VMAs */ >> >> One more crazy idea... >> >> There are 2^9 possible pud pages, 2^18 pmd pages and 2^27 pte pages. >> That's only 54 bits (technically minus one bit each because the upper >> half of the address space is for the kernel). > > Does this math make sense for all architecures? IA64? Power? No, the sizes will be different on the other architectures. But, 4k pages with 64-bit ptes is as bad as it gets, I think. Larger page sizes mean fewer page tables on powerpc. So the values should at least _fit_ in a long. Maybe it's not even worth the trouble. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-la0-f53.google.com (mail-la0-f53.google.com [209.85.215.53]) by kanga.kvack.org (Postfix) with ESMTP id 5022B6B0032 for ; Tue, 13 Jan 2015 16:43:58 -0500 (EST) Received: by mail-la0-f53.google.com with SMTP id gm9so5014899lab.12 for ; Tue, 13 Jan 2015 13:43:57 -0800 (PST) Received: from mail-lb0-x236.google.com (mail-lb0-x236.google.com. [2a00:1450:4010:c04::236]) by mx.google.com with ESMTPS id z4si4032203lae.103.2015.01.13.13.43.57 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 13 Jan 2015 13:43:57 -0800 (PST) Received: by mail-lb0-f182.google.com with SMTP id u10so4854940lbd.13 for ; Tue, 13 Jan 2015 13:43:57 -0800 (PST) Date: Wed, 14 Jan 2015 00:43:55 +0300 From: Cyrill Gorcunov Subject: Re: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables Message-ID: <20150113214355.GC2253@moon> References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" Cc: Andrew Morton , Hugh Dickins , linux-mm@kvack.org, Dave Hansen , Pavel Emelyanov , linux-kernel@vger.kernel.org On Tue, Jan 13, 2015 at 09:14:15PM +0200, Kirill A. Shutemov wrote: > We're going to account pmd page tables too. Let's rename mm->nr_pgtables > to something more generic. > > Signed-off-by: Kirill A. Shutemov > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -64,7 +64,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) > data << (PAGE_SHIFT-10), > mm->stack_vm << (PAGE_SHIFT-10), text, lib, > (PTRS_PER_PTE * sizeof(pte_t) * > - atomic_long_read(&mm->nr_ptes)) >> 10, > + atomic_long_read(&mm->nr_pgtables)) >> 10, This implies that (PTRS_PER_PTE * sizeof(pte_t)) = (PTRS_PER_PMD * sizeof(pmd_t)) which might be true for all archs, right? Other looks good to me. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f43.google.com (mail-pa0-f43.google.com [209.85.220.43]) by kanga.kvack.org (Postfix) with ESMTP id C6FF96B006C for ; Tue, 13 Jan 2015 16:49:15 -0500 (EST) Received: by mail-pa0-f43.google.com with SMTP id kx10so6093585pab.2 for ; Tue, 13 Jan 2015 13:49:15 -0800 (PST) Received: from mga14.intel.com (mga14.intel.com. [192.55.52.115]) by mx.google.com with ESMTP id n7si28231614pdj.247.2015.01.13.13.49.13 for ; Tue, 13 Jan 2015 13:49:14 -0800 (PST) Message-ID: <54B592D6.4090406@linux.intel.com> Date: Tue, 13 Jan 2015 13:49:10 -0800 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> <20150113214355.GC2253@moon> In-Reply-To: <20150113214355.GC2253@moon> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Cyrill Gorcunov , "Kirill A. Shutemov" Cc: Andrew Morton , Hugh Dickins , linux-mm@kvack.org, Pavel Emelyanov , linux-kernel@vger.kernel.org, Benjamin Herrenschmidt On 01/13/2015 01:43 PM, Cyrill Gorcunov wrote: > On Tue, Jan 13, 2015 at 09:14:15PM +0200, Kirill A. Shutemov wrote: >> We're going to account pmd page tables too. Let's rename mm->nr_pgtables >> to something more generic. >> >> Signed-off-by: Kirill A. Shutemov >> --- a/fs/proc/task_mmu.c >> +++ b/fs/proc/task_mmu.c >> @@ -64,7 +64,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) >> data << (PAGE_SHIFT-10), >> mm->stack_vm << (PAGE_SHIFT-10), text, lib, >> (PTRS_PER_PTE * sizeof(pte_t) * >> - atomic_long_read(&mm->nr_ptes)) >> 10, >> + atomic_long_read(&mm->nr_pgtables)) >> 10, > > This implies that (PTRS_PER_PTE * sizeof(pte_t)) = (PTRS_PER_PMD * sizeof(pmd_t)) > which might be true for all archs, right? I wonder if powerpc is OK on this front today. This diagram: http://linux-mm.org/PageTableStructure says that they use a 128-byte "pte" table when mapping 16M pages. I wonder if they bump mm->nr_ptes for these. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-la0-f41.google.com (mail-la0-f41.google.com [209.85.215.41]) by kanga.kvack.org (Postfix) with ESMTP id 2BE9E6B006C for ; Wed, 14 Jan 2015 04:45:41 -0500 (EST) Received: by mail-la0-f41.google.com with SMTP id hv19so7146574lab.0 for ; Wed, 14 Jan 2015 01:45:40 -0800 (PST) Received: from mail-lb0-x235.google.com (mail-lb0-x235.google.com. [2a00:1450:4010:c04::235]) by mx.google.com with ESMTPS id qz8si14069160lbc.116.2015.01.14.01.45.39 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 14 Jan 2015 01:45:40 -0800 (PST) Received: by mail-lb0-f181.google.com with SMTP id l4so6989234lbv.12 for ; Wed, 14 Jan 2015 01:45:39 -0800 (PST) Date: Wed, 14 Jan 2015 12:45:38 +0300 From: Cyrill Gorcunov Subject: Re: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables Message-ID: <20150114094538.GD2253@moon> References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> <20150113214355.GC2253@moon> <54B592D6.4090406@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54B592D6.4090406@linux.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: "Kirill A. Shutemov" , Andrew Morton , Hugh Dickins , linux-mm@kvack.org, Pavel Emelyanov , linux-kernel@vger.kernel.org, Benjamin Herrenschmidt On Tue, Jan 13, 2015 at 01:49:10PM -0800, Dave Hansen wrote: > On 01/13/2015 01:43 PM, Cyrill Gorcunov wrote: > > On Tue, Jan 13, 2015 at 09:14:15PM +0200, Kirill A. Shutemov wrote: > >> We're going to account pmd page tables too. Let's rename mm->nr_pgtables > >> to something more generic. > >> > >> Signed-off-by: Kirill A. Shutemov > >> --- a/fs/proc/task_mmu.c > >> +++ b/fs/proc/task_mmu.c > >> @@ -64,7 +64,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) > >> data << (PAGE_SHIFT-10), > >> mm->stack_vm << (PAGE_SHIFT-10), text, lib, > >> (PTRS_PER_PTE * sizeof(pte_t) * > >> - atomic_long_read(&mm->nr_ptes)) >> 10, > >> + atomic_long_read(&mm->nr_pgtables)) >> 10, > > > > This implies that (PTRS_PER_PTE * sizeof(pte_t)) = (PTRS_PER_PMD * sizeof(pmd_t)) > > which might be true for all archs, right? > > I wonder if powerpc is OK on this front today. This diagram: > > http://linux-mm.org/PageTableStructure > > says that they use a 128-byte "pte" table when mapping 16M pages. I > wonder if they bump mm->nr_ptes for these. It looks like this doesn't matter. The statistics here prints the size of summary memory occupied for pte_t entries, here PTRS_PER_PTE * sizeof(pte_t) is only valid for, once we start accounting pmd into same counter it implies that PTRS_PER_PTE == PTRS_PER_PMD, which is not true for all archs (if I understand the idea of accounting here right). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f182.google.com (mail-we0-f182.google.com [74.125.82.182]) by kanga.kvack.org (Postfix) with ESMTP id 2F9D56B0032 for ; Wed, 14 Jan 2015 09:34:10 -0500 (EST) Received: by mail-we0-f182.google.com with SMTP id w62so9069456wes.13 for ; Wed, 14 Jan 2015 06:34:09 -0800 (PST) Received: from jenni2.inet.fi (mta-out1.inet.fi. [62.71.2.195]) by mx.google.com with ESMTP id ev8si26557428wib.27.2015.01.14.06.34.08 for ; Wed, 14 Jan 2015 06:34:09 -0800 (PST) Date: Wed, 14 Jan 2015 16:33:58 +0200 From: "Kirill A. Shutemov" Subject: Re: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables Message-ID: <20150114143358.GA9820@node.dhcp.inet.fi> References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> <20150113214355.GC2253@moon> <54B592D6.4090406@linux.intel.com> <20150114094538.GD2253@moon> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150114094538.GD2253@moon> Sender: owner-linux-mm@kvack.org List-ID: To: Cyrill Gorcunov Cc: Dave Hansen , "Kirill A. Shutemov" , Andrew Morton , Hugh Dickins , linux-mm@kvack.org, Pavel Emelyanov , linux-kernel@vger.kernel.org, Benjamin Herrenschmidt On Wed, Jan 14, 2015 at 12:45:38PM +0300, Cyrill Gorcunov wrote: > On Tue, Jan 13, 2015 at 01:49:10PM -0800, Dave Hansen wrote: > > On 01/13/2015 01:43 PM, Cyrill Gorcunov wrote: > > > On Tue, Jan 13, 2015 at 09:14:15PM +0200, Kirill A. Shutemov wrote: > > >> We're going to account pmd page tables too. Let's rename mm->nr_pgtables > > >> to something more generic. > > >> > > >> Signed-off-by: Kirill A. Shutemov > > >> --- a/fs/proc/task_mmu.c > > >> +++ b/fs/proc/task_mmu.c > > >> @@ -64,7 +64,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) > > >> data << (PAGE_SHIFT-10), > > >> mm->stack_vm << (PAGE_SHIFT-10), text, lib, > > >> (PTRS_PER_PTE * sizeof(pte_t) * > > >> - atomic_long_read(&mm->nr_ptes)) >> 10, > > >> + atomic_long_read(&mm->nr_pgtables)) >> 10, > > > > > > This implies that (PTRS_PER_PTE * sizeof(pte_t)) = (PTRS_PER_PMD * sizeof(pmd_t)) > > > which might be true for all archs, right? I doubt it. And even if it's true now, nobody can guarantee that this will be true for all future configurations. > > I wonder if powerpc is OK on this front today. This diagram: > > > > http://linux-mm.org/PageTableStructure > > > > says that they use a 128-byte "pte" table when mapping 16M pages. I > > wonder if they bump mm->nr_ptes for these. > > It looks like this doesn't matter. The statistics here prints the size > of summary memory occupied for pte_t entries, here PTRS_PER_PTE * sizeof(pte_t) > is only valid for, once we start accounting pmd into same counter it implies > that PTRS_PER_PTE == PTRS_PER_PMD, which is not true for all archs > (if I understand the idea of accounting here right). Yeah. good catch. Thank you. I'll respin with separate counter for pmd tables. It seems the best option. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f182.google.com (mail-lb0-f182.google.com [209.85.217.182]) by kanga.kvack.org (Postfix) with ESMTP id B175A6B0032 for ; Wed, 14 Jan 2015 09:48:47 -0500 (EST) Received: by mail-lb0-f182.google.com with SMTP id u10so8211538lbd.13 for ; Wed, 14 Jan 2015 06:48:45 -0800 (PST) Received: from mail-la0-x22d.google.com (mail-la0-x22d.google.com. [2a00:1450:4010:c03::22d]) by mx.google.com with ESMTPS id h3si6142781lam.29.2015.01.14.06.48.45 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 14 Jan 2015 06:48:45 -0800 (PST) Received: by mail-la0-f45.google.com with SMTP id gq15so8503395lab.4 for ; Wed, 14 Jan 2015 06:48:45 -0800 (PST) Date: Wed, 14 Jan 2015 17:48:43 +0300 From: Cyrill Gorcunov Subject: Re: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables Message-ID: <20150114144843.GE2253@moon> References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> <20150113214355.GC2253@moon> <54B592D6.4090406@linux.intel.com> <20150114094538.GD2253@moon> <20150114143358.GA9820@node.dhcp.inet.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150114143358.GA9820@node.dhcp.inet.fi> Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" Cc: Dave Hansen , "Kirill A. Shutemov" , Andrew Morton , Hugh Dickins , linux-mm@kvack.org, Pavel Emelyanov , linux-kernel@vger.kernel.org, Benjamin Herrenschmidt On Wed, Jan 14, 2015 at 04:33:58PM +0200, Kirill A. Shutemov wrote: > > > > It looks like this doesn't matter. The statistics here prints the size > > of summary memory occupied for pte_t entries, here PTRS_PER_PTE * sizeof(pte_t) > > is only valid for, once we start accounting pmd into same counter it implies > > that PTRS_PER_PTE == PTRS_PER_PMD, which is not true for all archs > > (if I understand the idea of accounting here right). > > Yeah. good catch. Thank you. > > I'll respin with separate counter for pmd tables. It seems the best > option. Sounds good to me, thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753503AbbAMTOd (ORCPT ); Tue, 13 Jan 2015 14:14:33 -0500 Received: from mga11.intel.com ([192.55.52.93]:19418 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752592AbbAMTOb (ORCPT ); Tue, 13 Jan 2015 14:14:31 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.07,750,1413270000"; d="scan'208";a="650565848" From: "Kirill A. Shutemov" To: Andrew Morton , Hugh Dickins Cc: linux-mm@kvack.org, Dave Hansen , Cyrill Gorcunov , Pavel Emelyanov , linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH 0/2] Account PMD page tables to the process Date: Tue, 13 Jan 2015 21:14:14 +0200 Message-Id: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.1.4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently we don't account PMD page tables to the process. It can lead to local DoS: unprivileged user can allocate >500 MiB on x86_64 per process without being noticed by oom-killer or memory cgroup. Proposed fix adds accounting for PMD table the same way we account for PTE tables. There're few corner case in the accounting (see patch 2/2) which have not well tested yet. If anybody know any other cases we should handle, please let me know. Kirill A. Shutemov (2): mm: rename mm->nr_ptes to mm->nr_pgtables mm: account pmd page tables to the process Documentation/sysctl/vm.txt | 2 +- arch/x86/mm/pgtable.c | 13 ++++++++----- fs/proc/task_mmu.c | 2 +- include/linux/mm_types.h | 2 +- kernel/fork.c | 2 +- mm/debug.c | 4 ++-- mm/huge_memory.c | 10 +++++----- mm/hugetlb.c | 8 ++++++-- mm/memory.c | 6 ++++-- mm/mmap.c | 9 +++++++-- mm/oom_kill.c | 8 ++++---- 11 files changed, 40 insertions(+), 26 deletions(-) -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753793AbbAMTPs (ORCPT ); Tue, 13 Jan 2015 14:15:48 -0500 Received: from mga14.intel.com ([192.55.52.115]:61666 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752831AbbAMTOc (ORCPT ); Tue, 13 Jan 2015 14:14:32 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.07,750,1413270000"; d="scan'208";a="669271967" From: "Kirill A. Shutemov" To: Andrew Morton , Hugh Dickins Cc: linux-mm@kvack.org, Dave Hansen , Cyrill Gorcunov , Pavel Emelyanov , linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables Date: Tue, 13 Jan 2015 21:14:15 +0200 Message-Id: <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.1.4 In-Reply-To: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org We're going to account pmd page tables too. Let's rename mm->nr_pgtables to something more generic. Signed-off-by: Kirill A. Shutemov --- Documentation/sysctl/vm.txt | 2 +- fs/proc/task_mmu.c | 2 +- include/linux/mm_types.h | 2 +- kernel/fork.c | 2 +- mm/debug.c | 4 ++-- mm/huge_memory.c | 10 +++++----- mm/memory.c | 4 ++-- mm/mmap.c | 2 +- mm/oom_kill.c | 8 ++++---- 9 files changed, 18 insertions(+), 18 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 4415aa915681..0211a58ee6c3 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -557,7 +557,7 @@ oom_dump_tasks Enables a system-wide task dump (excluding kernel threads) to be produced when the kernel performs an OOM-killing and includes such -information as pid, uid, tgid, vm size, rss, nr_ptes, swapents, +information as pid, uid, tgid, vm size, rss, nr_pgtables, swapents, oom_score_adj score, and name. This is helpful to determine why the OOM killer was invoked, to identify the rogue task that caused it, and to determine why the OOM killer chose the task it did to kill. diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 8faae6fed085..6121cca220cb 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -64,7 +64,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) data << (PAGE_SHIFT-10), mm->stack_vm << (PAGE_SHIFT-10), text, lib, (PTRS_PER_PTE * sizeof(pte_t) * - atomic_long_read(&mm->nr_ptes)) >> 10, + atomic_long_read(&mm->nr_pgtables)) >> 10, swap << (PAGE_SHIFT-10)); } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 20ff2105b564..106be259a3ea 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -363,7 +363,7 @@ struct mm_struct { pgd_t * pgd; atomic_t mm_users; /* How many users with user space? */ atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ - atomic_long_t nr_ptes; /* Page table pages */ + atomic_long_t nr_pgtables; /* Page table pages */ int map_count; /* number of VMAs */ spinlock_t page_table_lock; /* Protects page tables and some counters */ diff --git a/kernel/fork.c b/kernel/fork.c index b379d9abddc7..d7d00cac1f66 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -554,7 +554,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p) init_rwsem(&mm->mmap_sem); INIT_LIST_HEAD(&mm->mmlist); mm->core_state = NULL; - atomic_long_set(&mm->nr_ptes, 0); + atomic_long_set(&mm->nr_pgtables, 0); mm->map_count = 0; mm->locked_vm = 0; mm->pinned_vm = 0; diff --git a/mm/debug.c b/mm/debug.c index d69cb5a7ba9a..229d3cba5677 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -173,7 +173,7 @@ void dump_mm(const struct mm_struct *mm) "get_unmapped_area %p\n" #endif "mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n" - "pgd %p mm_users %d mm_count %d nr_ptes %lu map_count %d\n" + "pgd %p mm_users %d mm_count %d nr_pgtables %lu map_count %d\n" "hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n" "pinned_vm %lx shared_vm %lx exec_vm %lx stack_vm %lx\n" "start_code %lx end_code %lx start_data %lx end_data %lx\n" @@ -205,7 +205,7 @@ void dump_mm(const struct mm_struct *mm) mm->mmap_base, mm->mmap_legacy_base, mm->highest_vm_end, mm->pgd, atomic_read(&mm->mm_users), atomic_read(&mm->mm_count), - atomic_long_read((atomic_long_t *)&mm->nr_ptes), + atomic_long_read((atomic_long_t *)&mm->nr_pgtables), mm->map_count, mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm, mm->pinned_vm, mm->shared_vm, mm->exec_vm, mm->stack_vm, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b29c48708b89..6ad66bc4da88 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -749,7 +749,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm, pgtable_trans_huge_deposit(mm, pmd, pgtable); set_pmd_at(mm, haddr, pmd, entry); add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR); - atomic_long_inc(&mm->nr_ptes); + atomic_long_inc(&mm->nr_pgtables); spin_unlock(ptl); } @@ -782,7 +782,7 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm, entry = pmd_mkhuge(entry); pgtable_trans_huge_deposit(mm, pmd, pgtable); set_pmd_at(mm, haddr, pmd, entry); - atomic_long_inc(&mm->nr_ptes); + atomic_long_inc(&mm->nr_pgtables); return true; } @@ -905,7 +905,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd = pmd_mkold(pmd_wrprotect(pmd)); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); set_pmd_at(dst_mm, addr, dst_pmd, pmd); - atomic_long_inc(&dst_mm->nr_ptes); + atomic_long_inc(&dst_mm->nr_pgtables); ret = 0; out_unlock: @@ -1429,7 +1429,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, tlb_remove_pmd_tlb_entry(tlb, pmd, addr); pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd); if (is_huge_zero_pmd(orig_pmd)) { - atomic_long_dec(&tlb->mm->nr_ptes); + atomic_long_dec(&tlb->mm->nr_pgtables); spin_unlock(ptl); put_huge_zero_page(); } else { @@ -1438,7 +1438,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, VM_BUG_ON_PAGE(page_mapcount(page) < 0, page); add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); VM_BUG_ON_PAGE(!PageHead(page), page); - atomic_long_dec(&tlb->mm->nr_ptes); + atomic_long_dec(&tlb->mm->nr_pgtables); spin_unlock(ptl); tlb_remove_page(tlb, page); } diff --git a/mm/memory.c b/mm/memory.c index 5afb6d89ac96..2f9ee3089c20 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -394,7 +394,7 @@ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, pgtable_t token = pmd_pgtable(*pmd); pmd_clear(pmd); pte_free_tlb(tlb, token, addr); - atomic_long_dec(&tlb->mm->nr_ptes); + atomic_long_dec(&tlb->mm->nr_pgtables); } static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, @@ -586,7 +586,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, ptl = pmd_lock(mm, pmd); wait_split_huge_page = 0; if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ - atomic_long_inc(&mm->nr_ptes); + atomic_long_inc(&mm->nr_pgtables); pmd_populate(mm, pmd, new); new = NULL; } else if (unlikely(pmd_trans_splitting(*pmd))) diff --git a/mm/mmap.c b/mm/mmap.c index 14d84666e8ba..3c591112263d 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2852,7 +2852,7 @@ void exit_mmap(struct mm_struct *mm) } vm_unacct_memory(nr_accounted); - WARN_ON(atomic_long_read(&mm->nr_ptes) > + WARN_ON(atomic_long_read(&mm->nr_pgtables) > (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 294493a7ae4b..d121ee7fa357 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -169,7 +169,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, * The baseline for the badness score is the proportion of RAM that each * task's rss, pagetable and swap space use. */ - points = get_mm_rss(p->mm) + atomic_long_read(&p->mm->nr_ptes) + + points = get_mm_rss(p->mm) + atomic_long_read(&p->mm->nr_pgtables) + get_mm_counter(p->mm, MM_SWAPENTS); task_unlock(p); @@ -345,7 +345,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, * Dumps the current memory state of all eligible tasks. Tasks not in the same * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes * are not shown. - * State information includes task's pid, uid, tgid, vm size, rss, nr_ptes, + * State information includes task's pid, uid, tgid, vm size, rss, nr_pgtables, * swapents, oom_score_adj value, and name. */ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) @@ -353,7 +353,7 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) struct task_struct *p; struct task_struct *task; - pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n"); + pr_info("[ pid ] uid tgid total_vm rss nr_pgtables swapents oom_score_adj name\n"); rcu_read_lock(); for_each_process(p) { if (oom_unkillable_task(p, memcg, nodemask)) @@ -372,7 +372,7 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) pr_info("[%5d] %5d %5d %8lu %8lu %7ld %8lu %5hd %s\n", task->pid, from_kuid(&init_user_ns, task_uid(task)), task->tgid, task->mm->total_vm, get_mm_rss(task->mm), - atomic_long_read(&task->mm->nr_ptes), + atomic_long_read(&task->mm->nr_pgtables), get_mm_counter(task->mm, MM_SWAPENTS), task->signal->oom_score_adj, task->comm); task_unlock(task); -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753741AbbAMTPF (ORCPT ); Tue, 13 Jan 2015 14:15:05 -0500 Received: from mga03.intel.com ([134.134.136.65]:20143 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753711AbbAMTPC (ORCPT ); Tue, 13 Jan 2015 14:15:02 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.07,750,1413270000"; d="scan'208";a="636767319" From: "Kirill A. Shutemov" To: Andrew Morton , Hugh Dickins Cc: linux-mm@kvack.org, Dave Hansen , Cyrill Gorcunov , Pavel Emelyanov , linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH 2/2] mm: account pmd page tables to the process Date: Tue, 13 Jan 2015 21:14:16 +0200 Message-Id: <1421176456-21796-3-git-send-email-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.1.4 In-Reply-To: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include #include #include #include #include #include #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror("re-mmap"), exit(1); } printf("PID %d consumed %lu KiB in PMD page tables\n", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Signed-off-by: Kirill A. Shutemov Reported-by: Dave Hansen --- arch/x86/mm/pgtable.c | 13 ++++++++----- mm/hugetlb.c | 8 ++++++-- mm/memory.c | 2 ++ mm/mmap.c | 9 +++++++-- 4 files changed, 23 insertions(+), 9 deletions(-) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 6fb6927f9e76..7f7b7005a7d5 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -190,7 +190,7 @@ void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd) #endif /* CONFIG_X86_PAE */ -static void free_pmds(pmd_t *pmds[]) +static void free_pmds(struct mm_struct *mm, pmd_t *pmds[]) { int i; @@ -198,10 +198,11 @@ static void free_pmds(pmd_t *pmds[]) if (pmds[i]) { pgtable_pmd_page_dtor(virt_to_page(pmds[i])); free_page((unsigned long)pmds[i]); + atomic_long_dec(&mm->nr_pgtables); } } -static int preallocate_pmds(pmd_t *pmds[]) +static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[]) { int i; bool failed = false; @@ -215,11 +216,13 @@ static int preallocate_pmds(pmd_t *pmds[]) pmd = NULL; failed = true; } + if (pmd) + atomic_long_inc(&mm->nr_pgtables); pmds[i] = pmd; } if (failed) { - free_pmds(pmds); + free_pmds(mm, pmds); return -ENOMEM; } @@ -283,7 +286,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) mm->pgd = pgd; - if (preallocate_pmds(pmds) != 0) + if (preallocate_pmds(mm, pmds) != 0) goto out_free_pgd; if (paravirt_pgd_alloc(mm) != 0) @@ -304,7 +307,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm) return pgd; out_free_pmds: - free_pmds(pmds); + free_pmds(mm, pmds); out_free_pgd: free_page((unsigned long)pgd); out: diff --git a/mm/hugetlb.c b/mm/hugetlb.c index be0e5d0db5ec..edd1278b4965 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3558,6 +3558,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) if (saddr) { spte = huge_pte_offset(svma->vm_mm, saddr); if (spte) { + atomic_long_inc(&mm->nr_pgtables); get_page(virt_to_page(spte)); break; } @@ -3569,11 +3570,13 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) ptl = huge_pte_lockptr(hstate_vma(vma), mm, spte); spin_lock(ptl); - if (pud_none(*pud)) + if (pud_none(*pud)) { pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK)); - else + } else { put_page(virt_to_page(spte)); + atomic_long_dec(&mm->nr_pgtables); + } spin_unlock(ptl); out: pte = (pte_t *)pmd_alloc(mm, pud, addr); @@ -3604,6 +3607,7 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep) pud_clear(pud); put_page(virt_to_page(ptep)); + atomic_long_dec(&mm->nr_pgtables); *addr = ALIGN(*addr, HPAGE_SIZE * PTRS_PER_PTE) - HPAGE_SIZE; return 1; } diff --git a/mm/memory.c b/mm/memory.c index 2f9ee3089c20..8b6f32e1c0b5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -428,6 +428,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, pmd = pmd_offset(pud, start); pud_clear(pud); pmd_free_tlb(tlb, pmd, start); + atomic_long_dec(&tlb->mm->nr_pgtables); } static inline void free_pud_range(struct mmu_gather *tlb, pgd_t *pgd, @@ -3321,6 +3322,7 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) smp_wmb(); /* See comment in __pte_alloc */ spin_lock(&mm->page_table_lock); + atomic_long_inc(&mm->nr_pgtables); #ifndef __ARCH_HAS_4LEVEL_HACK if (pud_present(*pud)) /* Another has populated it */ pmd_free(mm, new); diff --git a/mm/mmap.c b/mm/mmap.c index 3c591112263d..d8013a6b7ebd 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2812,6 +2812,7 @@ void exit_mmap(struct mm_struct *mm) struct mmu_gather tlb; struct vm_area_struct *vma; unsigned long nr_accounted = 0; + unsigned long max_nr_pgtables; /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); @@ -2852,8 +2853,12 @@ void exit_mmap(struct mm_struct *mm) } vm_unacct_memory(nr_accounted); - WARN_ON(atomic_long_read(&mm->nr_pgtables) > - (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT); + max_nr_pgtables = round_up(FIRST_USER_ADDRESS, PMD_SIZE) >> PMD_SHIFT; + if (!IS_ENABLED(__PAGETABLE_PMD_FOLDED)) { + max_nr_pgtables += + round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT; + } + WARN_ON(atomic_long_read(&mm->nr_pgtables) > max_nr_pgtables); } /* Insert vm structure into process list sorted by address -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752703AbbAMUgc (ORCPT ); Tue, 13 Jan 2015 15:36:32 -0500 Received: from mga14.intel.com ([192.55.52.115]:28810 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751175AbbAMUgb (ORCPT ); Tue, 13 Jan 2015 15:36:31 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.04,691,1406617200"; d="scan'208";a="511748605" Message-ID: <54B581C7.50206@linux.intel.com> Date: Tue, 13 Jan 2015 12:36:23 -0800 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: "Kirill A. Shutemov" , Andrew Morton , Hugh Dickins CC: linux-mm@kvack.org, Cyrill Gorcunov , Pavel Emelyanov , linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> In-Reply-To: <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/13/2015 11:14 AM, Kirill A. Shutemov wrote: > pgd_t * pgd; > atomic_t mm_users; /* How many users with user space? */ > atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ > - atomic_long_t nr_ptes; /* Page table pages */ > + atomic_long_t nr_pgtables; /* Page table pages */ > int map_count; /* number of VMAs */ One more crazy idea... There are 2^9 possible pud pages, 2^18 pmd pages and 2^27 pte pages. That's only 54 bits (technically minus one bit each because the upper half of the address space is for the kernel). That's enough to actually account for pte, pmd and pud pages separately without increasing the size of the storage we need. You could even enforce that warning you were talking about at exit time for pte pages, but just ignore pmd mismatches so you don't have false warnings on hugetlbfs shared pmd pages. Or, even better, strictly track pmd page usage _unless_ hugetlbfs shared pmds are in play and track _that_ in another bit. On 32-bit PAE, that's 2 bits for PMD pages, and 11 for PTE pages, so it should fit in an atomic_long_t there too. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753388AbbAMUmE (ORCPT ); Tue, 13 Jan 2015 15:42:04 -0500 Received: from mta-out1.inet.fi ([62.71.2.227]:38897 "EHLO jenni2.inet.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752291AbbAMUmB (ORCPT ); Tue, 13 Jan 2015 15:42:01 -0500 Date: Tue, 13 Jan 2015 22:41:44 +0200 From: "Kirill A. Shutemov" To: Dave Hansen Cc: "Kirill A. Shutemov" , Andrew Morton , Hugh Dickins , linux-mm@kvack.org, Cyrill Gorcunov , Pavel Emelyanov , linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables Message-ID: <20150113204144.GA1865@node.dhcp.inet.fi> References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> <54B581C7.50206@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54B581C7.50206@linux.intel.com> User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 13, 2015 at 12:36:23PM -0800, Dave Hansen wrote: > On 01/13/2015 11:14 AM, Kirill A. Shutemov wrote: > > pgd_t * pgd; > > atomic_t mm_users; /* How many users with user space? */ > > atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ > > - atomic_long_t nr_ptes; /* Page table pages */ > > + atomic_long_t nr_pgtables; /* Page table pages */ > > int map_count; /* number of VMAs */ > > One more crazy idea... > > There are 2^9 possible pud pages, 2^18 pmd pages and 2^27 pte pages. > That's only 54 bits (technically minus one bit each because the upper > half of the address space is for the kernel). Does this math make sense for all architecures? IA64? Power? -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752446AbbAMVfZ (ORCPT ); Tue, 13 Jan 2015 16:35:25 -0500 Received: from mga01.intel.com ([192.55.52.88]:45061 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751400AbbAMVfY (ORCPT ); Tue, 13 Jan 2015 16:35:24 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.04,691,1406617200"; d="scan'208";a="511774541" Message-ID: <54B58F9B.4050100@linux.intel.com> Date: Tue, 13 Jan 2015 13:35:23 -0800 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: "Kirill A. Shutemov" CC: "Kirill A. Shutemov" , Andrew Morton , Hugh Dickins , linux-mm@kvack.org, Cyrill Gorcunov , Pavel Emelyanov , linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> <54B581C7.50206@linux.intel.com> <20150113204144.GA1865@node.dhcp.inet.fi> In-Reply-To: <20150113204144.GA1865@node.dhcp.inet.fi> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/13/2015 12:41 PM, Kirill A. Shutemov wrote: > On Tue, Jan 13, 2015 at 12:36:23PM -0800, Dave Hansen wrote: >> On 01/13/2015 11:14 AM, Kirill A. Shutemov wrote: >>> pgd_t * pgd; >>> atomic_t mm_users; /* How many users with user space? */ >>> atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ >>> - atomic_long_t nr_ptes; /* Page table pages */ >>> + atomic_long_t nr_pgtables; /* Page table pages */ >>> int map_count; /* number of VMAs */ >> >> One more crazy idea... >> >> There are 2^9 possible pud pages, 2^18 pmd pages and 2^27 pte pages. >> That's only 54 bits (technically minus one bit each because the upper >> half of the address space is for the kernel). > > Does this math make sense for all architecures? IA64? Power? No, the sizes will be different on the other architectures. But, 4k pages with 64-bit ptes is as bad as it gets, I think. Larger page sizes mean fewer page tables on powerpc. So the values should at least _fit_ in a long. Maybe it's not even worth the trouble. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753264AbbAMVoA (ORCPT ); Tue, 13 Jan 2015 16:44:00 -0500 Received: from mail-la0-f41.google.com ([209.85.215.41]:52698 "EHLO mail-la0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752729AbbAMVn6 (ORCPT ); Tue, 13 Jan 2015 16:43:58 -0500 Date: Wed, 14 Jan 2015 00:43:55 +0300 From: Cyrill Gorcunov To: "Kirill A. Shutemov" Cc: Andrew Morton , Hugh Dickins , linux-mm@kvack.org, Dave Hansen , Pavel Emelyanov , linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables Message-ID: <20150113214355.GC2253@moon> References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 13, 2015 at 09:14:15PM +0200, Kirill A. Shutemov wrote: > We're going to account pmd page tables too. Let's rename mm->nr_pgtables > to something more generic. > > Signed-off-by: Kirill A. Shutemov > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -64,7 +64,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) > data << (PAGE_SHIFT-10), > mm->stack_vm << (PAGE_SHIFT-10), text, lib, > (PTRS_PER_PTE * sizeof(pte_t) * > - atomic_long_read(&mm->nr_ptes)) >> 10, > + atomic_long_read(&mm->nr_pgtables)) >> 10, This implies that (PTRS_PER_PTE * sizeof(pte_t)) = (PTRS_PER_PMD * sizeof(pmd_t)) which might be true for all archs, right? Other looks good to me. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753110AbbAMVtP (ORCPT ); Tue, 13 Jan 2015 16:49:15 -0500 Received: from mga03.intel.com ([134.134.136.65]:39922 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751566AbbAMVtN (ORCPT ); Tue, 13 Jan 2015 16:49:13 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.07,751,1413270000"; d="scan'208";a="669349482" Message-ID: <54B592D6.4090406@linux.intel.com> Date: Tue, 13 Jan 2015 13:49:10 -0800 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Cyrill Gorcunov , "Kirill A. Shutemov" CC: Andrew Morton , Hugh Dickins , linux-mm@kvack.org, Pavel Emelyanov , linux-kernel@vger.kernel.org, Benjamin Herrenschmidt Subject: Re: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> <20150113214355.GC2253@moon> In-Reply-To: <20150113214355.GC2253@moon> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/13/2015 01:43 PM, Cyrill Gorcunov wrote: > On Tue, Jan 13, 2015 at 09:14:15PM +0200, Kirill A. Shutemov wrote: >> We're going to account pmd page tables too. Let's rename mm->nr_pgtables >> to something more generic. >> >> Signed-off-by: Kirill A. Shutemov >> --- a/fs/proc/task_mmu.c >> +++ b/fs/proc/task_mmu.c >> @@ -64,7 +64,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) >> data << (PAGE_SHIFT-10), >> mm->stack_vm << (PAGE_SHIFT-10), text, lib, >> (PTRS_PER_PTE * sizeof(pte_t) * >> - atomic_long_read(&mm->nr_ptes)) >> 10, >> + atomic_long_read(&mm->nr_pgtables)) >> 10, > > This implies that (PTRS_PER_PTE * sizeof(pte_t)) = (PTRS_PER_PMD * sizeof(pmd_t)) > which might be true for all archs, right? I wonder if powerpc is OK on this front today. This diagram: http://linux-mm.org/PageTableStructure says that they use a 128-byte "pte" table when mapping 16M pages. I wonder if they bump mm->nr_ptes for these. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753584AbbANJpp (ORCPT ); Wed, 14 Jan 2015 04:45:45 -0500 Received: from mail-lb0-f171.google.com ([209.85.217.171]:50165 "EHLO mail-lb0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753073AbbANJpl (ORCPT ); Wed, 14 Jan 2015 04:45:41 -0500 Date: Wed, 14 Jan 2015 12:45:38 +0300 From: Cyrill Gorcunov To: Dave Hansen Cc: "Kirill A. Shutemov" , Andrew Morton , Hugh Dickins , linux-mm@kvack.org, Pavel Emelyanov , linux-kernel@vger.kernel.org, Benjamin Herrenschmidt Subject: Re: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables Message-ID: <20150114094538.GD2253@moon> References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> <20150113214355.GC2253@moon> <54B592D6.4090406@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54B592D6.4090406@linux.intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 13, 2015 at 01:49:10PM -0800, Dave Hansen wrote: > On 01/13/2015 01:43 PM, Cyrill Gorcunov wrote: > > On Tue, Jan 13, 2015 at 09:14:15PM +0200, Kirill A. Shutemov wrote: > >> We're going to account pmd page tables too. Let's rename mm->nr_pgtables > >> to something more generic. > >> > >> Signed-off-by: Kirill A. Shutemov > >> --- a/fs/proc/task_mmu.c > >> +++ b/fs/proc/task_mmu.c > >> @@ -64,7 +64,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) > >> data << (PAGE_SHIFT-10), > >> mm->stack_vm << (PAGE_SHIFT-10), text, lib, > >> (PTRS_PER_PTE * sizeof(pte_t) * > >> - atomic_long_read(&mm->nr_ptes)) >> 10, > >> + atomic_long_read(&mm->nr_pgtables)) >> 10, > > > > This implies that (PTRS_PER_PTE * sizeof(pte_t)) = (PTRS_PER_PMD * sizeof(pmd_t)) > > which might be true for all archs, right? > > I wonder if powerpc is OK on this front today. This diagram: > > http://linux-mm.org/PageTableStructure > > says that they use a 128-byte "pte" table when mapping 16M pages. I > wonder if they bump mm->nr_ptes for these. It looks like this doesn't matter. The statistics here prints the size of summary memory occupied for pte_t entries, here PTRS_PER_PTE * sizeof(pte_t) is only valid for, once we start accounting pmd into same counter it implies that PTRS_PER_PTE == PTRS_PER_PMD, which is not true for all archs (if I understand the idea of accounting here right). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752601AbbANOfJ (ORCPT ); Wed, 14 Jan 2015 09:35:09 -0500 Received: from mta-out1.inet.fi ([62.71.2.227]:38265 "EHLO jenni2.inet.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751272AbbANOfH (ORCPT ); Wed, 14 Jan 2015 09:35:07 -0500 Date: Wed, 14 Jan 2015 16:33:58 +0200 From: "Kirill A. Shutemov" To: Cyrill Gorcunov Cc: Dave Hansen , "Kirill A. Shutemov" , Andrew Morton , Hugh Dickins , linux-mm@kvack.org, Pavel Emelyanov , linux-kernel@vger.kernel.org, Benjamin Herrenschmidt Subject: Re: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables Message-ID: <20150114143358.GA9820@node.dhcp.inet.fi> References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> <20150113214355.GC2253@moon> <54B592D6.4090406@linux.intel.com> <20150114094538.GD2253@moon> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150114094538.GD2253@moon> User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 14, 2015 at 12:45:38PM +0300, Cyrill Gorcunov wrote: > On Tue, Jan 13, 2015 at 01:49:10PM -0800, Dave Hansen wrote: > > On 01/13/2015 01:43 PM, Cyrill Gorcunov wrote: > > > On Tue, Jan 13, 2015 at 09:14:15PM +0200, Kirill A. Shutemov wrote: > > >> We're going to account pmd page tables too. Let's rename mm->nr_pgtables > > >> to something more generic. > > >> > > >> Signed-off-by: Kirill A. Shutemov > > >> --- a/fs/proc/task_mmu.c > > >> +++ b/fs/proc/task_mmu.c > > >> @@ -64,7 +64,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) > > >> data << (PAGE_SHIFT-10), > > >> mm->stack_vm << (PAGE_SHIFT-10), text, lib, > > >> (PTRS_PER_PTE * sizeof(pte_t) * > > >> - atomic_long_read(&mm->nr_ptes)) >> 10, > > >> + atomic_long_read(&mm->nr_pgtables)) >> 10, > > > > > > This implies that (PTRS_PER_PTE * sizeof(pte_t)) = (PTRS_PER_PMD * sizeof(pmd_t)) > > > which might be true for all archs, right? I doubt it. And even if it's true now, nobody can guarantee that this will be true for all future configurations. > > I wonder if powerpc is OK on this front today. This diagram: > > > > http://linux-mm.org/PageTableStructure > > > > says that they use a 128-byte "pte" table when mapping 16M pages. I > > wonder if they bump mm->nr_ptes for these. > > It looks like this doesn't matter. The statistics here prints the size > of summary memory occupied for pte_t entries, here PTRS_PER_PTE * sizeof(pte_t) > is only valid for, once we start accounting pmd into same counter it implies > that PTRS_PER_PTE == PTRS_PER_PMD, which is not true for all archs > (if I understand the idea of accounting here right). Yeah. good catch. Thank you. I'll respin with separate counter for pmd tables. It seems the best option. -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753436AbbANOss (ORCPT ); Wed, 14 Jan 2015 09:48:48 -0500 Received: from mail-la0-f42.google.com ([209.85.215.42]:57031 "EHLO mail-la0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751999AbbANOsq (ORCPT ); Wed, 14 Jan 2015 09:48:46 -0500 Date: Wed, 14 Jan 2015 17:48:43 +0300 From: Cyrill Gorcunov To: "Kirill A. Shutemov" Cc: Dave Hansen , "Kirill A. Shutemov" , Andrew Morton , Hugh Dickins , linux-mm@kvack.org, Pavel Emelyanov , linux-kernel@vger.kernel.org, Benjamin Herrenschmidt Subject: Re: [PATCH 1/2] mm: rename mm->nr_ptes to mm->nr_pgtables Message-ID: <20150114144843.GE2253@moon> References: <1421176456-21796-1-git-send-email-kirill.shutemov@linux.intel.com> <1421176456-21796-2-git-send-email-kirill.shutemov@linux.intel.com> <20150113214355.GC2253@moon> <54B592D6.4090406@linux.intel.com> <20150114094538.GD2253@moon> <20150114143358.GA9820@node.dhcp.inet.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150114143358.GA9820@node.dhcp.inet.fi> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 14, 2015 at 04:33:58PM +0200, Kirill A. Shutemov wrote: > > > > It looks like this doesn't matter. The statistics here prints the size > > of summary memory occupied for pte_t entries, here PTRS_PER_PTE * sizeof(pte_t) > > is only valid for, once we start accounting pmd into same counter it implies > > that PTRS_PER_PTE == PTRS_PER_PMD, which is not true for all archs > > (if I understand the idea of accounting here right). > > Yeah. good catch. Thank you. > > I'll respin with separate counter for pmd tables. It seems the best > option. Sounds good to me, thanks.