From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f170.google.com (mail-wi0-f170.google.com [209.85.212.170]) by kanga.kvack.org (Postfix) with ESMTP id D30696B0038 for ; Sun, 14 Jun 2015 11:05:01 -0400 (EDT) Received: by wifx6 with SMTP id x6so54717308wif.0 for ; Sun, 14 Jun 2015 08:05:01 -0700 (PDT) Received: from mail-wi0-x22c.google.com (mail-wi0-x22c.google.com. [2a00:1450:400c:c05::22c]) by mx.google.com with ESMTPS id p14si13661064wiv.47.2015.06.14.08.04.59 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 14 Jun 2015 08:05:00 -0700 (PDT) Received: by wigg3 with SMTP id g3so54438685wig.1 for ; Sun, 14 Jun 2015 08:04:59 -0700 (PDT) From: Ebru Akagunduz Subject: [RFC 0/3] mm: make swapin readahead to gain more thp performance Date: Sun, 14 Jun 2015 18:04:40 +0300 Message-Id: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, riel@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com, Ebru Akagunduz This patch series makes swapin readahead up to a certain number to gain more thp performance and adds tracepoint for khugepaged_scan_pmd, collapse_huge_page, __collapse_huge_page_isolate. This patch series was written to deal with programs that access most, but not all, of their memory after they get swapped out. Currently these programs do not get their memory collapsed into THPs after the system swapped their memory out, while they would get THPs before swapping happened. This patch series was tested with a test program, it allocates 800MB of memory, writes to it, and then sleeps. I force the system to swap out all. Afterwards, the test program touches the area by writing and leaves a piece of it without writing. This shows how much swap in readahead made by the patch. I've written down test results: With the patch: After swapped out: cat /proc/pid/smaps: Anonymous: 470760 kB AnonHugePages: 468992 kB Swap: 329244 kB Fraction: %99 After swapped in: In ten minutes: cat /proc/pid/smaps: Anonymous: 769208 kB AnonHugePages: 765952 kB Swap: 30796 kB Fraction: %99 Without the patch: After swapped out: cat /proc/pid/smaps: Anonymous: 238160 kB AnonHugePages: 235520 kB Swap: 561844 kB Fraction: %98 After swapped in: cat /proc/pid/smaps: In ten minutes: Anonymous: 499956 kB AnonHugePages: 235520 kB Swap: 300048 kB Fraction: %47 Ebru Akagunduz (3): mm: add tracepoint for scanning pages mm: make optimistic check for swapin readahead mm: make swapin readahead to improve thp collapse rate include/linux/mm.h | 4 ++ include/trace/events/huge_memory.h | 123 +++++++++++++++++++++++++++++++++++++ mm/huge_memory.c | 56 ++++++++++++++++- mm/memory.c | 2 +- 4 files changed, 181 insertions(+), 4 deletions(-) create mode 100644 include/trace/events/huge_memory.h -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f52.google.com (mail-wg0-f52.google.com [74.125.82.52]) by kanga.kvack.org (Postfix) with ESMTP id EDB356B006C for ; Sun, 14 Jun 2015 11:05:03 -0400 (EDT) Received: by wgzl5 with SMTP id l5so26828934wgz.3 for ; Sun, 14 Jun 2015 08:05:03 -0700 (PDT) Received: from mail-wi0-x230.google.com (mail-wi0-x230.google.com. [2a00:1450:400c:c05::230]) by mx.google.com with ESMTPS id hg2si4218179wib.50.2015.06.14.08.05.02 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 14 Jun 2015 08:05:02 -0700 (PDT) Received: by wibdq8 with SMTP id dq8so54694520wib.1 for ; Sun, 14 Jun 2015 08:05:02 -0700 (PDT) From: Ebru Akagunduz Subject: [RFC 1/3] mm: add tracepoint for scanning pages Date: Sun, 14 Jun 2015 18:04:41 +0300 Message-Id: <1434294283-8699-2-git-send-email-ebru.akagunduz@gmail.com> In-Reply-To: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, riel@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com, Ebru Akagunduz Using static tracepoints, data of functions is recorded. It is good to automatize debugging without doing a lot of changes in the source code. This patch adds tracepoint for khugepaged_scan_pmd, collapse_huge_page and __collapse_huge_page_isolate. Signed-off-by: Ebru Akagunduz --- include/trace/events/huge_memory.h | 96 ++++++++++++++++++++++++++++++++++++++ mm/huge_memory.c | 10 +++- 2 files changed, 105 insertions(+), 1 deletion(-) create mode 100644 include/trace/events/huge_memory.h diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h new file mode 100644 index 0000000..4b9049b --- /dev/null +++ b/include/trace/events/huge_memory.h @@ -0,0 +1,96 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM huge_memory + +#if !defined(__HUGE_MEMORY_H) || defined(TRACE_HEADER_MULTI_READ) +#define __HUGE_MEMORY_H + +#include + +TRACE_EVENT(mm_khugepaged_scan_pmd, + + TP_PROTO(struct mm_struct *mm, unsigned long vm_start, bool writable, + bool referenced, int none_or_zero, int collapse), + + TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse), + + TP_STRUCT__entry( + __field(struct mm_struct *, mm) + __field(unsigned long, vm_start) + __field(bool, writable) + __field(bool, referenced) + __field(int, none_or_zero) + __field(int, collapse) + ), + + TP_fast_assign( + __entry->mm = mm; + __entry->vm_start = vm_start; + __entry->writable = writable; + __entry->referenced = referenced; + __entry->none_or_zero = none_or_zero; + __entry->collapse = collapse; + ), + + TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d", + __entry->mm, + __entry->vm_start, + __entry->writable, + __entry->referenced, + __entry->none_or_zero, + __entry->collapse) +); + +TRACE_EVENT(mm_collapse_huge_page, + + TP_PROTO(struct mm_struct *mm, unsigned long vm_start, int isolated), + + TP_ARGS(mm, vm_start, isolated), + + TP_STRUCT__entry( + __field(struct mm_struct *, mm) + __field(unsigned long, vm_start) + __field(int, isolated) + ), + + TP_fast_assign( + __entry->mm = mm; + __entry->vm_start = vm_start; + __entry->isolated = isolated; + ), + + TP_printk("mm=%p, vm_start=%04lx, isolated=%d", + __entry->mm, + __entry->vm_start, + __entry->isolated) +); + +TRACE_EVENT(mm_collapse_huge_page_isolate, + + TP_PROTO(unsigned long vm_start, int none_or_zero, + bool referenced, bool writable), + + TP_ARGS(vm_start, none_or_zero, referenced, writable), + + TP_STRUCT__entry( + __field(unsigned long, vm_start) + __field(int, none_or_zero) + __field(bool, referenced) + __field(bool, writable) + ), + + TP_fast_assign( + __entry->vm_start = vm_start; + __entry->none_or_zero = none_or_zero; + __entry->referenced = referenced; + __entry->writable = writable; + ), + + TP_printk("vm_start=%04lx, none_or_zero=%d, referenced=%d, writable=%d", + __entry->vm_start, + __entry->none_or_zero, + __entry->referenced, + __entry->writable) +); + +#endif /* __HUGE_MEMORY_H */ +#include diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9671f51..9bb97fc 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -29,6 +29,9 @@ #include #include "internal.h" +#define CREATE_TRACE_POINTS +#include + /* * By default transparent hugepage support is disabled in order that avoid * to risk increase the memory footprint of applications without a guaranteed @@ -2266,6 +2269,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, if (likely(referenced && writable)) return 1; out: + trace_mm_collapse_huge_page_isolate(vma->vm_start, none_or_zero, + referenced, writable); release_pte_pages(pte, _pte); return 0; } @@ -2501,7 +2506,7 @@ static void collapse_huge_page(struct mm_struct *mm, pgtable_t pgtable; struct page *new_page; spinlock_t *pmd_ptl, *pte_ptl; - int isolated; + int isolated = 0; unsigned long hstart, hend; struct mem_cgroup *memcg; unsigned long mmun_start; /* For mmu_notifiers */ @@ -2619,6 +2624,7 @@ static void collapse_huge_page(struct mm_struct *mm, khugepaged_pages_collapsed++; out_up_write: up_write(&mm->mmap_sem); + trace_mm_collapse_huge_page(mm, vma->vm_start, isolated); return; out: @@ -2694,6 +2700,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, ret = 1; out_unmap: pte_unmap_unlock(pte, ptl); + trace_mm_khugepaged_scan_pmd(mm, vma->vm_start, writable, referenced, + none_or_zero, ret); if (ret) { node = khugepaged_find_target_node(); /* collapse_huge_page will return with the mmap_sem released */ -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f45.google.com (mail-wg0-f45.google.com [74.125.82.45]) by kanga.kvack.org (Postfix) with ESMTP id CEC796B006E for ; Sun, 14 Jun 2015 11:05:06 -0400 (EDT) Received: by wgzl5 with SMTP id l5so26829483wgz.3 for ; Sun, 14 Jun 2015 08:05:06 -0700 (PDT) Received: from mail-wi0-x230.google.com (mail-wi0-x230.google.com. [2a00:1450:400c:c05::230]) by mx.google.com with ESMTPS id sb18si17397705wjb.120.2015.06.14.08.05.04 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 14 Jun 2015 08:05:05 -0700 (PDT) Received: by wigg3 with SMTP id g3so54439971wig.1 for ; Sun, 14 Jun 2015 08:05:04 -0700 (PDT) From: Ebru Akagunduz Subject: [RFC 2/3] mm: make optimistic check for swapin readahead Date: Sun, 14 Jun 2015 18:04:42 +0300 Message-Id: <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> In-Reply-To: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, riel@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com, Ebru Akagunduz This patch makes optimistic check for swapin readahead to increase thp collapse rate. Before getting swapped out pages to memory, checks them and allows up to a certain number. It also prints out using tracepoints amount of unmapped ptes. Signed-off-by: Ebru Akagunduz --- include/trace/events/huge_memory.h | 11 +++++++---- mm/huge_memory.c | 13 ++++++++++--- 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h index 4b9049b..53c9f2e 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -9,9 +9,9 @@ TRACE_EVENT(mm_khugepaged_scan_pmd, TP_PROTO(struct mm_struct *mm, unsigned long vm_start, bool writable, - bool referenced, int none_or_zero, int collapse), + bool referenced, int none_or_zero, int collapse, int unmapped), - TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse), + TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse, unmapped), TP_STRUCT__entry( __field(struct mm_struct *, mm) @@ -20,6 +20,7 @@ TRACE_EVENT(mm_khugepaged_scan_pmd, __field(bool, referenced) __field(int, none_or_zero) __field(int, collapse) + __field(int, unmapped) ), TP_fast_assign( @@ -29,15 +30,17 @@ TRACE_EVENT(mm_khugepaged_scan_pmd, __entry->referenced = referenced; __entry->none_or_zero = none_or_zero; __entry->collapse = collapse; + __entry->unmapped = unmapped; ), - TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d", + TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d, unmapped=%d", __entry->mm, __entry->vm_start, __entry->writable, __entry->referenced, __entry->none_or_zero, - __entry->collapse) + __entry->collapse, + __entry->unmapped) ); TRACE_EVENT(mm_collapse_huge_page, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9bb97fc..22bc0bf 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, { pmd_t *pmd; pte_t *pte, *_pte; - int ret = 0, none_or_zero = 0; + int ret = 0, none_or_zero = 0, unmapped = 0; struct page *page; unsigned long _address; spinlock_t *ptl; - int node = NUMA_NO_NODE; + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8; bool writable = false, referenced = false; VM_BUG_ON(address & ~HPAGE_PMD_MASK); @@ -2657,6 +2658,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++, _address += PAGE_SIZE) { pte_t pteval = *_pte; + if (is_swap_pte(pteval)) { + if (++unmapped <= max_ptes_swap) + continue; + else + goto out_unmap; + } if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { if (!userfaultfd_armed(vma) && ++none_or_zero <= khugepaged_max_ptes_none) @@ -2701,7 +2708,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, out_unmap: pte_unmap_unlock(pte, ptl); trace_mm_khugepaged_scan_pmd(mm, vma->vm_start, writable, referenced, - none_or_zero, ret); + none_or_zero, ret, unmapped); if (ret) { node = khugepaged_find_target_node(); /* collapse_huge_page will return with the mmap_sem released */ -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f47.google.com (mail-wg0-f47.google.com [74.125.82.47]) by kanga.kvack.org (Postfix) with ESMTP id 44C736B0070 for ; Sun, 14 Jun 2015 11:05:10 -0400 (EDT) Received: by wgv5 with SMTP id 5so51726895wgv.1 for ; Sun, 14 Jun 2015 08:05:09 -0700 (PDT) Received: from mail-wg0-x22a.google.com (mail-wg0-x22a.google.com. [2a00:1450:400c:c00::22a]) by mx.google.com with ESMTPS id s7si13629880wiw.104.2015.06.14.08.05.08 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 14 Jun 2015 08:05:08 -0700 (PDT) Received: by wgez8 with SMTP id z8so51803655wge.0 for ; Sun, 14 Jun 2015 08:05:08 -0700 (PDT) From: Ebru Akagunduz Subject: [RFC 3/3] mm: make swapin readahead to improve thp collapse rate Date: Sun, 14 Jun 2015 18:04:43 +0300 Message-Id: <1434294283-8699-4-git-send-email-ebru.akagunduz@gmail.com> In-Reply-To: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, riel@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com, Ebru Akagunduz This patch makes swapin readahead to improve thp collapse rate. When khugepaged scanned pages, there can be a few of the pages in swap area. With the patch THP can collapse 4kB pages into a THP when there are up to max_ptes_swap swap ptes in a 2MB range. The patch was tested with a test program that allocates 800MB of memory, writes to it, and then sleeps. I force the system to swap out all. Afterwards, the test program touches the area by writing, it skips a page in each 20 pages of the area. Without the patch, system did not swap in readahead. THP rate was %47 of the program of the memory, it did not change over time. With this patch, after 10 minutes of waiting khugepaged had collapsed %99 of the program's memory. Signed-off-by: Ebru Akagunduz --- I've written down test results: With the patch: After swapped out: cat /proc/pid/smaps: Anonymous: 470760 kB AnonHugePages: 468992 kB Swap: 329244 kB Fraction: %99 After swapped in: In ten minutes: cat /proc/pid/smaps: Anonymous: 769208 kB AnonHugePages: 765952 kB Swap: 30796 kB Fraction: %99 Without the patch: After swapped out: cat /proc/pid/smaps: Anonymous: 238160 kB AnonHugePages: 235520 kB Swap: 561844 kB Fraction: %98 After swapped in: cat /proc/pid/smaps: In ten minutes: Anonymous: 499956 kB AnonHugePages: 235520 kB Swap: 300048 kB Fraction: %47 include/linux/mm.h | 4 ++++ include/trace/events/huge_memory.h | 24 ++++++++++++++++++++++++ mm/huge_memory.c | 35 +++++++++++++++++++++++++++++++++++ mm/memory.c | 2 +- 4 files changed, 64 insertions(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 7f47178..f66ff8a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -29,6 +29,10 @@ struct user_struct; struct writeback_control; struct bdi_writeback; +extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *page_table, pmd_t *pmd, + unsigned int flags, pte_t orig_pte); + #ifndef CONFIG_NEED_MULTIPLE_NODES /* Don't use mapnrs, do it properly */ extern unsigned long max_mapnr; diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h index 53c9f2e..0117ab9 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -95,5 +95,29 @@ TRACE_EVENT(mm_collapse_huge_page_isolate, __entry->writable) ); +TRACE_EVENT(mm_collapse_huge_page_swapin, + + TP_PROTO(struct mm_struct *mm, unsigned long vm_start, int swap_pte), + + TP_ARGS(mm, vm_start, swap_pte), + + TP_STRUCT__entry( + __field(struct mm_struct *, mm) + __field(unsigned long, vm_start) + __field(int, swap_pte) + ), + + TP_fast_assign( + __entry->mm = mm; + __entry->vm_start = vm_start; + __entry->swap_pte = swap_pte; + ), + + TP_printk("mm=%p, vm_start=%04lx, swap_pte=%d", + __entry->mm, + __entry->vm_start, + __entry->swap_pte) +); + #endif /* __HUGE_MEMORY_H */ #include diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 22bc0bf..cb3e82a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2496,6 +2496,39 @@ static bool hugepage_vma_check(struct vm_area_struct *vma) return true; } +/* + * Bring missing pages in from swap, to complete THP collapse. + * Only done if khugepaged_scan_pmd believes it is worthwhile. + * + * Called and returns without pte mapped or spinlocks held, + * but with mmap_sem held to protect against vma changes. + */ + +static void __collapse_huge_page_swapin(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + pte_t *pte) +{ + unsigned long _address; + pte_t pteval = *pte; + int swap_pte = 0; + + pte = pte_offset_map(pmd, address); + for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE; + pte++, _address += PAGE_SIZE) { + pteval = *pte; + if (is_swap_pte(pteval)) { + swap_pte++; + do_swap_page(mm, vma, _address, pte, pmd, 0x0, pteval); + /* pte is unmapped now, we need to map it */ + pte = pte_offset_map(pmd, _address); + } + } + pte--; + pte_unmap(pte); + trace_mm_collapse_huge_page_swapin(mm, vma->vm_start, swap_pte); +} + static void collapse_huge_page(struct mm_struct *mm, unsigned long address, struct page **hpage, @@ -2551,6 +2584,8 @@ static void collapse_huge_page(struct mm_struct *mm, if (!pmd) goto out; + __collapse_huge_page_swapin(mm, vma, address, pmd, pte); + anon_vma_lock_write(vma->anon_vma); pte = pte_offset_map(pmd, address); diff --git a/mm/memory.c b/mm/memory.c index e1c45d0..d801dc5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2443,7 +2443,7 @@ EXPORT_SYMBOL(unmap_mapping_range); * We return with the mmap_sem locked or unlocked in the same cases * as does filemap_fault(). */ -static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, +int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, unsigned int flags, pte_t orig_pte) { -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f169.google.com (mail-qc0-f169.google.com [209.85.216.169]) by kanga.kvack.org (Postfix) with ESMTP id 7C7A46B0038 for ; Sun, 14 Jun 2015 21:04:40 -0400 (EDT) Received: by qcej3 with SMTP id j3so3033767qce.3 for ; Sun, 14 Jun 2015 18:04:40 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id z92si11229367qgd.1.2015.06.14.18.04.39 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 14 Jun 2015 18:04:39 -0700 (PDT) Message-ID: <557E249B.6070208@redhat.com> Date: Sun, 14 Jun 2015 21:04:27 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [RFC 1/3] mm: add tracepoint for scanning pages References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-2-git-send-email-ebru.akagunduz@gmail.com> In-Reply-To: <1434294283-8699-2-git-send-email-ebru.akagunduz@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Ebru Akagunduz , linux-mm@kvack.org Cc: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com On 06/14/2015 11:04 AM, Ebru Akagunduz wrote: > Using static tracepoints, data of functions is recorded. > It is good to automatize debugging without doing a lot > of changes in the source code. > > This patch adds tracepoint for khugepaged_scan_pmd, > collapse_huge_page and __collapse_huge_page_isolate. These trace points seem like a useful set to figure out what the THP collapse code is doing. > Signed-off-by: Ebru Akagunduz Acked-by: Rik van Riel -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f53.google.com (mail-wg0-f53.google.com [74.125.82.53]) by kanga.kvack.org (Postfix) with ESMTP id 05CA56B0032 for ; Mon, 15 Jun 2015 01:40:26 -0400 (EDT) Received: by wgv5 with SMTP id 5so59579224wgv.1 for ; Sun, 14 Jun 2015 22:40:25 -0700 (PDT) Received: from mail-wi0-f177.google.com (mail-wi0-f177.google.com. [209.85.212.177]) by mx.google.com with ESMTPS id k2si20272656wjr.0.2015.06.14.22.40.23 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 14 Jun 2015 22:40:24 -0700 (PDT) Received: by wigg3 with SMTP id g3so65361949wig.1 for ; Sun, 14 Jun 2015 22:40:23 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> From: Leon Romanovsky Date: Mon, 15 Jun 2015 08:40:03 +0300 Message-ID: Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Ebru Akagunduz Cc: Linux-MM , Andrew Morton , kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange , riel@redhat.com, iamjoonsoo.kim@lge.com, Xiexiuqi , gorcunov@openvz.org, "linux-kernel@vger.kernel.org" , Mel Gorman , rientjes@google.com, Vlastimil Babka , aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, Johannes Weiner , mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com On Sun, Jun 14, 2015 at 6:04 PM, Ebru Akagunduz wrote: > This patch makes optimistic check for swapin readahead > to increase thp collapse rate. Before getting swapped > out pages to memory, checks them and allows up to a > certain number. It also prints out using tracepoints > amount of unmapped ptes. > > Signed-off-by: Ebru Akagunduz > --- > include/trace/events/huge_memory.h | 11 +++++++---- > mm/huge_memory.c | 13 ++++++++++--- > 2 files changed, 17 insertions(+), 7 deletions(-) > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h > index 4b9049b..53c9f2e 100644 > --- a/include/trace/events/huge_memory.h > +++ b/include/trace/events/huge_memory.h > @@ -9,9 +9,9 @@ > TRACE_EVENT(mm_khugepaged_scan_pmd, > > TP_PROTO(struct mm_struct *mm, unsigned long vm_start, bool writable, > - bool referenced, int none_or_zero, int collapse), > + bool referenced, int none_or_zero, int collapse, int unmapped), > > - TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse), > + TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse, unmapped), > > TP_STRUCT__entry( > __field(struct mm_struct *, mm) > @@ -20,6 +20,7 @@ TRACE_EVENT(mm_khugepaged_scan_pmd, > __field(bool, referenced) > __field(int, none_or_zero) > __field(int, collapse) > + __field(int, unmapped) > ), > > TP_fast_assign( > @@ -29,15 +30,17 @@ TRACE_EVENT(mm_khugepaged_scan_pmd, > __entry->referenced = referenced; > __entry->none_or_zero = none_or_zero; > __entry->collapse = collapse; > + __entry->unmapped = unmapped; > ), > > - TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d", > + TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d, unmapped=%d", > __entry->mm, > __entry->vm_start, > __entry->writable, > __entry->referenced, > __entry->none_or_zero, > - __entry->collapse) > + __entry->collapse, > + __entry->unmapped) > ); > > TRACE_EVENT(mm_collapse_huge_page, > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 9bb97fc..22bc0bf 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -24,6 +24,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > { > pmd_t *pmd; > pte_t *pte, *_pte; > - int ret = 0, none_or_zero = 0; > + int ret = 0, none_or_zero = 0, unmapped = 0; > struct page *page; > unsigned long _address; > spinlock_t *ptl; > - int node = NUMA_NO_NODE; > + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8; Sorry for asking, my knoweldge of THP is very limited, but why did you choose this default value? >>From the discussion followed by your patch (https://lkml.org/lkml/2015/2/27/432), I got an impression that it is not necessary right value. > bool writable = false, referenced = false; > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > @@ -2657,6 +2658,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR; > _pte++, _address += PAGE_SIZE) { > pte_t pteval = *_pte; > + if (is_swap_pte(pteval)) { > + if (++unmapped <= max_ptes_swap) > + continue; > + else > + goto out_unmap; > + } > if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > if (!userfaultfd_armed(vma) && > ++none_or_zero <= khugepaged_max_ptes_none) > @@ -2701,7 +2708,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > out_unmap: > pte_unmap_unlock(pte, ptl); > trace_mm_khugepaged_scan_pmd(mm, vma->vm_start, writable, referenced, > - none_or_zero, ret); > + none_or_zero, ret, unmapped); > if (ret) { > node = khugepaged_find_target_node(); > /* collapse_huge_page will return with the mmap_sem released */ > -- > 1.9.1 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Leon Romanovsky | Independent Linux Consultant www.leon.nu | leon@leon.nu -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f169.google.com (mail-qk0-f169.google.com [209.85.220.169]) by kanga.kvack.org (Postfix) with ESMTP id 79B1F6B006E for ; Mon, 15 Jun 2015 01:43:14 -0400 (EDT) Received: by qkdm188 with SMTP id m188so27493894qkd.1 for ; Sun, 14 Jun 2015 22:43:14 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id e207si11757936qhc.3.2015.06.14.22.43.13 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 14 Jun 2015 22:43:13 -0700 (PDT) Message-ID: <557E65E7.9010000@redhat.com> Date: Mon, 15 Jun 2015 01:43:03 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Leon Romanovsky , Ebru Akagunduz Cc: Linux-MM , Andrew Morton , kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange , iamjoonsoo.kim@lge.com, Xiexiuqi , gorcunov@openvz.org, "linux-kernel@vger.kernel.org" , Mel Gorman , rientjes@google.com, Vlastimil Babka , aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, Johannes Weiner , mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com On 06/15/2015 01:40 AM, Leon Romanovsky wrote: > On Sun, Jun 14, 2015 at 6:04 PM, Ebru Akagunduz > wrote: >> This patch makes optimistic check for swapin readahead >> to increase thp collapse rate. Before getting swapped >> out pages to memory, checks them and allows up to a >> certain number. It also prints out using tracepoints >> amount of unmapped ptes. >> >> Signed-off-by: Ebru Akagunduz >> @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, >> { >> pmd_t *pmd; >> pte_t *pte, *_pte; >> - int ret = 0, none_or_zero = 0; >> + int ret = 0, none_or_zero = 0, unmapped = 0; >> struct page *page; >> unsigned long _address; >> spinlock_t *ptl; >> - int node = NUMA_NO_NODE; >> + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8; > Sorry for asking, my knoweldge of THP is very limited, but why did you > choose this default value? > From the discussion followed by your patch > (https://lkml.org/lkml/2015/2/27/432), I got an impression that it is > not necessary right value. I believe that Ebru's main focus for this initial version of the patch series was to get the _mechanism_ (patch 3) right, while having a fairly simple policy to drive it. Any suggestions on when it is a good idea to bring in pages from swap, and whether to treat resident-in-swap-cache pages differently from need-to-be-paged-in pages, and what other factors should be examined, are very welcome... -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f48.google.com (mail-wg0-f48.google.com [74.125.82.48]) by kanga.kvack.org (Postfix) with ESMTP id 5E5556B0032 for ; Mon, 15 Jun 2015 02:08:24 -0400 (EDT) Received: by wgez8 with SMTP id z8so60031609wge.0 for ; Sun, 14 Jun 2015 23:08:24 -0700 (PDT) Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com. [209.85.212.180]) by mx.google.com with ESMTPS id es2si16597539wib.45.2015.06.14.23.08.22 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 14 Jun 2015 23:08:23 -0700 (PDT) Received: by wicnd19 with SMTP id nd19so13907536wic.1 for ; Sun, 14 Jun 2015 23:08:22 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <557E65E7.9010000@redhat.com> References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> <557E65E7.9010000@redhat.com> From: Leon Romanovsky Date: Mon, 15 Jun 2015 09:08:02 +0300 Message-ID: Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Ebru Akagunduz , Linux-MM , Andrew Morton , "kirill.shutemov" , n-horiguchi , aarcange , "iamjoonsoo.kim" , Xiexiuqi , gorcunov , "linux-kernel@vger.kernel.org" , Mel Gorman , rientjes , Vlastimil Babka , "aneesh.kumar" , hughd , Johannes Weiner , mhocko , boaz , raindel On Mon, Jun 15, 2015 at 8:43 AM, Rik van Riel wrote: > On 06/15/2015 01:40 AM, Leon Romanovsky wrote: >> On Sun, Jun 14, 2015 at 6:04 PM, Ebru Akagunduz >> wrote: >>> This patch makes optimistic check for swapin readahead >>> to increase thp collapse rate. Before getting swapped >>> out pages to memory, checks them and allows up to a >>> certain number. It also prints out using tracepoints >>> amount of unmapped ptes. >>> >>> Signed-off-by: Ebru Akagunduz > >>> @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, >>> { >>> pmd_t *pmd; >>> pte_t *pte, *_pte; >>> - int ret = 0, none_or_zero = 0; >>> + int ret = 0, none_or_zero = 0, unmapped = 0; >>> struct page *page; >>> unsigned long _address; >>> spinlock_t *ptl; >>> - int node = NUMA_NO_NODE; >>> + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8; >> Sorry for asking, my knoweldge of THP is very limited, but why did you >> choose this default value? >> From the discussion followed by your patch >> (https://lkml.org/lkml/2015/2/27/432), I got an impression that it is >> not necessary right value. > > I believe that Ebru's main focus for this initial version of > the patch series was to get the _mechanism_ (patch 3) right, > while having a fairly simple policy to drive it. > > Any suggestions on when it is a good idea to bring in pages > from swap, and whether to treat resident-in-swap-cache pages > differently from need-to-be-paged-in pages, and what other > factors should be examined, are very welcome... My concern with these patches that they deal with specific load/scenario (most of the application returned back from swap). In scenario there only 10% of data will be required, it theoretically can bring upto 80% data (70% waste). > > -- > All rights reversed -- Leon Romanovsky | Independent Linux Consultant www.leon.nu | leon@leon.nu -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f171.google.com (mail-qk0-f171.google.com [209.85.220.171]) by kanga.kvack.org (Postfix) with ESMTP id 5F4CF6B0032 for ; Mon, 15 Jun 2015 02:35:45 -0400 (EDT) Received: by qkhu186 with SMTP id u186so1257226qkh.0 for ; Sun, 14 Jun 2015 23:35:45 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id z92si11850109qgd.1.2015.06.14.23.35.43 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 14 Jun 2015 23:35:44 -0700 (PDT) Message-ID: <557E7235.1090105@redhat.com> Date: Mon, 15 Jun 2015 02:35:33 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> <557E65E7.9010000@redhat.com> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Leon Romanovsky Cc: Ebru Akagunduz , Linux-MM , Andrew Morton , "kirill.shutemov" , n-horiguchi , aarcange , "iamjoonsoo.kim" , Xiexiuqi , gorcunov , "linux-kernel@vger.kernel.org" , Mel Gorman , rientjes , Vlastimil Babka , "aneesh.kumar" , hughd , Johannes Weiner , mhocko , boaz , raindel On 06/15/2015 02:08 AM, Leon Romanovsky wrote: > On Mon, Jun 15, 2015 at 8:43 AM, Rik van Riel wrote: >> On 06/15/2015 01:40 AM, Leon Romanovsky wrote: >>> On Sun, Jun 14, 2015 at 6:04 PM, Ebru Akagunduz >>> wrote: >>>> This patch makes optimistic check for swapin readahead >>>> to increase thp collapse rate. Before getting swapped >>>> out pages to memory, checks them and allows up to a >>>> certain number. It also prints out using tracepoints >>>> amount of unmapped ptes. >>>> >>>> Signed-off-by: Ebru Akagunduz >> >>>> @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, >>>> { >>>> pmd_t *pmd; >>>> pte_t *pte, *_pte; >>>> - int ret = 0, none_or_zero = 0; >>>> + int ret = 0, none_or_zero = 0, unmapped = 0; >>>> struct page *page; >>>> unsigned long _address; >>>> spinlock_t *ptl; >>>> - int node = NUMA_NO_NODE; >>>> + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8; >>> Sorry for asking, my knoweldge of THP is very limited, but why did you >>> choose this default value? >>> From the discussion followed by your patch >>> (https://lkml.org/lkml/2015/2/27/432), I got an impression that it is >>> not necessary right value. >> >> I believe that Ebru's main focus for this initial version of >> the patch series was to get the _mechanism_ (patch 3) right, >> while having a fairly simple policy to drive it. >> >> Any suggestions on when it is a good idea to bring in pages >> from swap, and whether to treat resident-in-swap-cache pages >> differently from need-to-be-paged-in pages, and what other >> factors should be examined, are very welcome... > My concern with these patches that they deal with specific > load/scenario (most of the application returned back from swap). In > scenario there only 10% of data will be required, it theoretically can > bring upto 80% data (70% waste). The chosen threshold ensures that the remaining non-resident 4kB pages in a THP are only brought in if 7/8th (or 87.5%) of the pages are already resident. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f180.google.com (mail-qc0-f180.google.com [209.85.216.180]) by kanga.kvack.org (Postfix) with ESMTP id 69DAB6B006E for ; Mon, 15 Jun 2015 10:00:00 -0400 (EDT) Received: by qcsf5 with SMTP id f5so3600731qcs.2 for ; Mon, 15 Jun 2015 07:00:00 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id m97si8115567qkh.28.2015.06.15.06.59.58 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 15 Jun 2015 06:59:59 -0700 (PDT) Message-ID: <557EDA53.3040906@redhat.com> Date: Mon, 15 Jun 2015 09:59:47 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [RFC 3/3] mm: make swapin readahead to improve thp collapse rate References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-4-git-send-email-ebru.akagunduz@gmail.com> In-Reply-To: <1434294283-8699-4-git-send-email-ebru.akagunduz@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Ebru Akagunduz , linux-mm@kvack.org Cc: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com On 06/14/2015 11:04 AM, Ebru Akagunduz wrote: > This patch makes swapin readahead to improve thp collapse rate. > When khugepaged scanned pages, there can be a few of the pages > in swap area. > > With the patch THP can collapse 4kB pages into a THP when > there are up to max_ptes_swap swap ptes in a 2MB range. > > The patch was tested with a test program that allocates > 800MB of memory, writes to it, and then sleeps. I force > the system to swap out all. Afterwards, the test program > touches the area by writing, it skips a page in each > 20 pages of the area. > > Without the patch, system did not swap in readahead. > THP rate was %47 of the program of the memory, it > did not change over time. > > With this patch, after 10 minutes of waiting khugepaged had > collapsed %99 of the program's memory. > > Signed-off-by: Ebru Akagunduz Mechanism looks good to me. Acked-by: Rik van Riel -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f175.google.com (mail-qc0-f175.google.com [209.85.216.175]) by kanga.kvack.org (Postfix) with ESMTP id 9ADF26B006E for ; Mon, 15 Jun 2015 10:05:31 -0400 (EDT) Received: by qcsf5 with SMTP id f5so3663323qcs.2 for ; Mon, 15 Jun 2015 07:05:31 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id i8si9640141qgf.23.2015.06.15.07.05.30 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 15 Jun 2015 07:05:31 -0700 (PDT) Message-ID: <557EDBA2.9090308@redhat.com> Date: Mon, 15 Jun 2015 10:05:22 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> In-Reply-To: <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Ebru Akagunduz , linux-mm@kvack.org Cc: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com On 06/14/2015 11:04 AM, Ebru Akagunduz wrote: > This patch makes optimistic check for swapin readahead > to increase thp collapse rate. Before getting swapped > out pages to memory, checks them and allows up to a > certain number. It also prints out using tracepoints > amount of unmapped ptes. > > Signed-off-by: Ebru Akagunduz > @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > { > pmd_t *pmd; > pte_t *pte, *_pte; > - int ret = 0, none_or_zero = 0; > + int ret = 0, none_or_zero = 0, unmapped = 0; > struct page *page; > unsigned long _address; > spinlock_t *ptl; > - int node = NUMA_NO_NODE; > + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8; > bool writable = false, referenced = false; This has the effect of only swapping in 4kB pages to form a THP if 7/8th of the THP is already resident in memory. This is a pretty conservative thing to do. I am not sure if we would also need to take into account things like these: 1) How many pages in the THP-area are recently referenced? Maybe this does not matter if 87.5% of the 4kB pages got faulted in after swap-out, anyway? 2) How much free memory does the system have? We don't test that for collapsing a THP with lots of pte_none() ptes, so not sure how much this matters... 3) How many of the pages we want to swap in are already resident in the swap cache? Not sure exactly what to do with this number... 4) other factors? I am also not sure how we would determine such a policy, except by maybe having these patches sit in -mm and -next for a few cycles, and seeing what happens... -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by kanga.kvack.org (Postfix) with ESMTP id 03D816B0071 for ; Mon, 15 Jun 2015 12:08:08 -0400 (EDT) Received: by wiwd19 with SMTP id d19so79895656wiw.0 for ; Mon, 15 Jun 2015 09:08:07 -0700 (PDT) Received: from mail-wi0-f170.google.com (mail-wi0-f170.google.com. [209.85.212.170]) by mx.google.com with ESMTPS id gb6si15528167wic.42.2015.06.15.09.08.06 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 15 Jun 2015 09:08:06 -0700 (PDT) Received: by wifx6 with SMTP id x6so83624981wif.0 for ; Mon, 15 Jun 2015 09:08:06 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <557EDBA2.9090308@redhat.com> References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> <557EDBA2.9090308@redhat.com> From: Leon Romanovsky Date: Mon, 15 Jun 2015 19:07:45 +0300 Message-ID: Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Ebru Akagunduz , Linux-MM , Andrew Morton , "kirill.shutemov" , n-horiguchi , aarcange , "iamjoonsoo.kim" , Xiexiuqi , gorcunov , "linux-kernel@vger.kernel.org" , Mel Gorman , rientjes , Vlastimil Babka , "aneesh.kumar" , Hugh Dickins , Johannes Weiner , mhocko , boaz , raindel On Mon, Jun 15, 2015 at 5:05 PM, Rik van Riel wrote: > > On 06/14/2015 11:04 AM, Ebru Akagunduz wrote: > > This patch makes optimistic check for swapin readahead > > to increase thp collapse rate. Before getting swapped > > out pages to memory, checks them and allows up to a > > certain number. It also prints out using tracepoints > > amount of unmapped ptes. > > > > Signed-off-by: Ebru Akagunduz > > > @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > > { > > pmd_t *pmd; > > pte_t *pte, *_pte; > > - int ret = 0, none_or_zero = 0; > > + int ret = 0, none_or_zero = 0, unmapped = 0; > > struct page *page; > > unsigned long _address; > > spinlock_t *ptl; > > - int node = NUMA_NO_NODE; > > + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8; > > bool writable = false, referenced = false; > > This has the effect of only swapping in 4kB pages to form a THP > if 7/8th of the THP is already resident in memory. Thanks for clarifing it to me. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f181.google.com (mail-pd0-f181.google.com [209.85.192.181]) by kanga.kvack.org (Postfix) with ESMTP id D2E3D6B0038 for ; Tue, 16 Jun 2015 17:15:42 -0400 (EDT) Received: by pdjn11 with SMTP id n11so22571955pdj.0 for ; Tue, 16 Jun 2015 14:15:42 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id tv5si2887991pbc.226.2015.06.16.14.15.41 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 16 Jun 2015 14:15:41 -0700 (PDT) Date: Tue, 16 Jun 2015 14:15:40 -0700 From: Andrew Morton Subject: Re: [RFC 3/3] mm: make swapin readahead to improve thp collapse rate Message-Id: <20150616141540.adc40130139151bf19f07ff9@linux-foundation.org> In-Reply-To: <1434294283-8699-4-git-send-email-ebru.akagunduz@gmail.com> References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-4-git-send-email-ebru.akagunduz@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Ebru Akagunduz Cc: linux-mm@kvack.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, riel@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com On Sun, 14 Jun 2015 18:04:43 +0300 Ebru Akagunduz wrote: > This patch makes swapin readahead to improve thp collapse rate. > When khugepaged scanned pages, there can be a few of the pages > in swap area. > > With the patch THP can collapse 4kB pages into a THP when > there are up to max_ptes_swap swap ptes in a 2MB range. > > The patch was tested with a test program that allocates > 800MB of memory, writes to it, and then sleeps. I force > the system to swap out all. Afterwards, the test program > touches the area by writing, it skips a page in each > 20 pages of the area. > > Without the patch, system did not swap in readahead. > THP rate was %47 of the program of the memory, it > did not change over time. > > With this patch, after 10 minutes of waiting khugepaged had > collapsed %99 of the program's memory. > > ... > > +/* > + * Bring missing pages in from swap, to complete THP collapse. > + * Only done if khugepaged_scan_pmd believes it is worthwhile. > + * > + * Called and returns without pte mapped or spinlocks held, > + * but with mmap_sem held to protect against vma changes. > + */ > + > +static void __collapse_huge_page_swapin(struct mm_struct *mm, > + struct vm_area_struct *vma, > + unsigned long address, pmd_t *pmd, > + pte_t *pte) > +{ > + unsigned long _address; > + pte_t pteval = *pte; > + int swap_pte = 0; > + > + pte = pte_offset_map(pmd, address); > + for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE; > + pte++, _address += PAGE_SIZE) { > + pteval = *pte; > + if (is_swap_pte(pteval)) { > + swap_pte++; > + do_swap_page(mm, vma, _address, pte, pmd, 0x0, pteval); > + /* pte is unmapped now, we need to map it */ > + pte = pte_offset_map(pmd, _address); > + } > + } > + pte--; > + pte_unmap(pte); > + trace_mm_collapse_huge_page_swapin(mm, vma->vm_start, swap_pte); > +} This is doing a series of synchronous reads. That will be sloooow on spinning disks. This function should be significantly faster if it first gets all the necessary I/O underway. I don't think we have a function which exactly does this. Perhaps generalise swapin_readahead() or open-code something like blk_start_plug(...); for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE; pte++, _address += PAGE_SIZE) { if (is_swap_pte(*pte)) { read_swap_cache_async(...); } } blk_finish_plug(...); If you do make a change such as this, please benchmark its effects. Not on SSD ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f52.google.com (mail-qg0-f52.google.com [209.85.192.52]) by kanga.kvack.org (Postfix) with ESMTP id 511DD6B0032 for ; Tue, 16 Jun 2015 23:20:32 -0400 (EDT) Received: by qgeu36 with SMTP id u36so11583017qge.2 for ; Tue, 16 Jun 2015 20:20:32 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id r14si3086658qha.35.2015.06.16.20.20.31 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 16 Jun 2015 20:20:31 -0700 (PDT) Message-ID: <5580E774.3070307@redhat.com> Date: Tue, 16 Jun 2015 23:20:20 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [RFC 3/3] mm: make swapin readahead to improve thp collapse rate References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-4-git-send-email-ebru.akagunduz@gmail.com> <20150616141540.adc40130139151bf19f07ff9@linux-foundation.org> In-Reply-To: <20150616141540.adc40130139151bf19f07ff9@linux-foundation.org> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Ebru Akagunduz Cc: linux-mm@kvack.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com On 06/16/2015 05:15 PM, Andrew Morton wrote: > On Sun, 14 Jun 2015 18:04:43 +0300 Ebru Akagunduz wrote: > >> This patch makes swapin readahead to improve thp collapse rate. >> When khugepaged scanned pages, there can be a few of the pages >> in swap area. >> >> With the patch THP can collapse 4kB pages into a THP when >> there are up to max_ptes_swap swap ptes in a 2MB range. >> >> The patch was tested with a test program that allocates >> 800MB of memory, writes to it, and then sleeps. I force >> the system to swap out all. Afterwards, the test program >> touches the area by writing, it skips a page in each >> 20 pages of the area. >> >> Without the patch, system did not swap in readahead. >> THP rate was %47 of the program of the memory, it >> did not change over time. >> >> With this patch, after 10 minutes of waiting khugepaged had >> collapsed %99 of the program's memory. >> >> ... >> >> +/* >> + * Bring missing pages in from swap, to complete THP collapse. >> + * Only done if khugepaged_scan_pmd believes it is worthwhile. >> + * >> + * Called and returns without pte mapped or spinlocks held, >> + * but with mmap_sem held to protect against vma changes. >> + */ >> + >> +static void __collapse_huge_page_swapin(struct mm_struct *mm, >> + struct vm_area_struct *vma, >> + unsigned long address, pmd_t *pmd, >> + pte_t *pte) >> +{ >> + unsigned long _address; >> + pte_t pteval = *pte; >> + int swap_pte = 0; >> + >> + pte = pte_offset_map(pmd, address); >> + for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE; >> + pte++, _address += PAGE_SIZE) { >> + pteval = *pte; >> + if (is_swap_pte(pteval)) { >> + swap_pte++; >> + do_swap_page(mm, vma, _address, pte, pmd, 0x0, pteval); >> + /* pte is unmapped now, we need to map it */ >> + pte = pte_offset_map(pmd, _address); >> + } >> + } >> + pte--; >> + pte_unmap(pte); >> + trace_mm_collapse_huge_page_swapin(mm, vma->vm_start, swap_pte); >> +} > > This is doing a series of synchronous reads. That will be sloooow on > spinning disks. > > This function should be significantly faster if it first gets all the > necessary I/O underway. I don't think we have a function which exactly > does this. Perhaps generalise swapin_readahead() or open-code > something like Looking at do_swap_page() and __lock_page_or_retry(), I guess there already is a way to do the above. Passing a "flags" of FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT to do_swap_page() should result in do_swap_page() returning with the pte unmapped and the mmap_sem still held if the page was not immediately available to map into the pte (trylock_page succeeds). Ebru, can you try passing the above as the flags argument to do_swap_page(), and see what happens? -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f54.google.com (mail-wg0-f54.google.com [74.125.82.54]) by kanga.kvack.org (Postfix) with ESMTP id 77F8E6B0072 for ; Wed, 17 Jun 2015 13:39:02 -0400 (EDT) Received: by wgv5 with SMTP id 5so43084529wgv.1 for ; Wed, 17 Jun 2015 10:39:02 -0700 (PDT) Received: from mail-wg0-x229.google.com (mail-wg0-x229.google.com. [2a00:1450:400c:c00::229]) by mx.google.com with ESMTPS id la1si9110811wjc.209.2015.06.17.10.39.00 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 17 Jun 2015 10:39:01 -0700 (PDT) Received: by wgv5 with SMTP id 5so43083918wgv.1 for ; Wed, 17 Jun 2015 10:39:00 -0700 (PDT) Date: Wed, 17 Jun 2015 20:38:56 +0300 From: Ebru Akagunduz Subject: Re: [RFC 3/3] mm: make swapin readahead to improve thp collapse rate Message-ID: <20150617173856.GA3970@debian> References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-4-git-send-email-ebru.akagunduz@gmail.com> <20150616141540.adc40130139151bf19f07ff9@linux-foundation.org> <5580E774.3070307@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5580E774.3070307@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: linux-mm@kvack.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com On Tue, Jun 16, 2015 at 11:20:20PM -0400, Rik van Riel wrote: > On 06/16/2015 05:15 PM, Andrew Morton wrote: > > On Sun, 14 Jun 2015 18:04:43 +0300 Ebru Akagunduz wrote: > > > >> This patch makes swapin readahead to improve thp collapse rate. > >> When khugepaged scanned pages, there can be a few of the pages > >> in swap area. > >> > >> With the patch THP can collapse 4kB pages into a THP when > >> there are up to max_ptes_swap swap ptes in a 2MB range. > >> > >> The patch was tested with a test program that allocates > >> 800MB of memory, writes to it, and then sleeps. I force > >> the system to swap out all. Afterwards, the test program > >> touches the area by writing, it skips a page in each > >> 20 pages of the area. > >> > >> Without the patch, system did not swap in readahead. > >> THP rate was %47 of the program of the memory, it > >> did not change over time. > >> > >> With this patch, after 10 minutes of waiting khugepaged had > >> collapsed %99 of the program's memory. > >> > >> ... > >> > >> +/* > >> + * Bring missing pages in from swap, to complete THP collapse. > >> + * Only done if khugepaged_scan_pmd believes it is worthwhile. > >> + * > >> + * Called and returns without pte mapped or spinlocks held, > >> + * but with mmap_sem held to protect against vma changes. > >> + */ > >> + > >> +static void __collapse_huge_page_swapin(struct mm_struct *mm, > >> + struct vm_area_struct *vma, > >> + unsigned long address, pmd_t *pmd, > >> + pte_t *pte) > >> +{ > >> + unsigned long _address; > >> + pte_t pteval = *pte; > >> + int swap_pte = 0; > >> + > >> + pte = pte_offset_map(pmd, address); > >> + for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE; > >> + pte++, _address += PAGE_SIZE) { > >> + pteval = *pte; > >> + if (is_swap_pte(pteval)) { > >> + swap_pte++; > >> + do_swap_page(mm, vma, _address, pte, pmd, 0x0, pteval); > >> + /* pte is unmapped now, we need to map it */ > >> + pte = pte_offset_map(pmd, _address); > >> + } > >> + } > >> + pte--; > >> + pte_unmap(pte); > >> + trace_mm_collapse_huge_page_swapin(mm, vma->vm_start, swap_pte); > >> +} > > > > This is doing a series of synchronous reads. That will be sloooow on > > spinning disks. > > > > This function should be significantly faster if it first gets all the > > necessary I/O underway. I don't think we have a function which exactly > > does this. Perhaps generalise swapin_readahead() or open-code > > something like > > Looking at do_swap_page() and __lock_page_or_retry(), I guess > there already is a way to do the above. > > Passing a "flags" of FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT > to do_swap_page() should result in do_swap_page() returning with > the pte unmapped and the mmap_sem still held if the page was not > immediately available to map into the pte (trylock_page succeeds). > > Ebru, can you try passing the above as the flags argument to > do_swap_page(), and see what happens? I will try and resent the patch series. Thanks for suggestions. Ebru -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753154AbbFNPFI (ORCPT ); Sun, 14 Jun 2015 11:05:08 -0400 Received: from mail-wi0-f172.google.com ([209.85.212.172]:36851 "EHLO mail-wi0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751615AbbFNPFA (ORCPT ); Sun, 14 Jun 2015 11:05:00 -0400 From: Ebru Akagunduz To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, riel@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com, Ebru Akagunduz Subject: [RFC 0/3] mm: make swapin readahead to gain more thp performance Date: Sun, 14 Jun 2015 18:04:40 +0300 Message-Id: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> X-Mailer: git-send-email 1.9.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch series makes swapin readahead up to a certain number to gain more thp performance and adds tracepoint for khugepaged_scan_pmd, collapse_huge_page, __collapse_huge_page_isolate. This patch series was written to deal with programs that access most, but not all, of their memory after they get swapped out. Currently these programs do not get their memory collapsed into THPs after the system swapped their memory out, while they would get THPs before swapping happened. This patch series was tested with a test program, it allocates 800MB of memory, writes to it, and then sleeps. I force the system to swap out all. Afterwards, the test program touches the area by writing and leaves a piece of it without writing. This shows how much swap in readahead made by the patch. I've written down test results: With the patch: After swapped out: cat /proc/pid/smaps: Anonymous: 470760 kB AnonHugePages: 468992 kB Swap: 329244 kB Fraction: %99 After swapped in: In ten minutes: cat /proc/pid/smaps: Anonymous: 769208 kB AnonHugePages: 765952 kB Swap: 30796 kB Fraction: %99 Without the patch: After swapped out: cat /proc/pid/smaps: Anonymous: 238160 kB AnonHugePages: 235520 kB Swap: 561844 kB Fraction: %98 After swapped in: cat /proc/pid/smaps: In ten minutes: Anonymous: 499956 kB AnonHugePages: 235520 kB Swap: 300048 kB Fraction: %47 Ebru Akagunduz (3): mm: add tracepoint for scanning pages mm: make optimistic check for swapin readahead mm: make swapin readahead to improve thp collapse rate include/linux/mm.h | 4 ++ include/trace/events/huge_memory.h | 123 +++++++++++++++++++++++++++++++++++++ mm/huge_memory.c | 56 ++++++++++++++++- mm/memory.c | 2 +- 4 files changed, 181 insertions(+), 4 deletions(-) create mode 100644 include/trace/events/huge_memory.h -- 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753375AbbFNPF0 (ORCPT ); Sun, 14 Jun 2015 11:05:26 -0400 Received: from mail-wg0-f53.google.com ([74.125.82.53]:33420 "EHLO mail-wg0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753051AbbFNPFD (ORCPT ); Sun, 14 Jun 2015 11:05:03 -0400 From: Ebru Akagunduz To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, riel@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com, Ebru Akagunduz Subject: [RFC 1/3] mm: add tracepoint for scanning pages Date: Sun, 14 Jun 2015 18:04:41 +0300 Message-Id: <1434294283-8699-2-git-send-email-ebru.akagunduz@gmail.com> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Using static tracepoints, data of functions is recorded. It is good to automatize debugging without doing a lot of changes in the source code. This patch adds tracepoint for khugepaged_scan_pmd, collapse_huge_page and __collapse_huge_page_isolate. Signed-off-by: Ebru Akagunduz --- include/trace/events/huge_memory.h | 96 ++++++++++++++++++++++++++++++++++++++ mm/huge_memory.c | 10 +++- 2 files changed, 105 insertions(+), 1 deletion(-) create mode 100644 include/trace/events/huge_memory.h diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h new file mode 100644 index 0000000..4b9049b --- /dev/null +++ b/include/trace/events/huge_memory.h @@ -0,0 +1,96 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM huge_memory + +#if !defined(__HUGE_MEMORY_H) || defined(TRACE_HEADER_MULTI_READ) +#define __HUGE_MEMORY_H + +#include + +TRACE_EVENT(mm_khugepaged_scan_pmd, + + TP_PROTO(struct mm_struct *mm, unsigned long vm_start, bool writable, + bool referenced, int none_or_zero, int collapse), + + TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse), + + TP_STRUCT__entry( + __field(struct mm_struct *, mm) + __field(unsigned long, vm_start) + __field(bool, writable) + __field(bool, referenced) + __field(int, none_or_zero) + __field(int, collapse) + ), + + TP_fast_assign( + __entry->mm = mm; + __entry->vm_start = vm_start; + __entry->writable = writable; + __entry->referenced = referenced; + __entry->none_or_zero = none_or_zero; + __entry->collapse = collapse; + ), + + TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d", + __entry->mm, + __entry->vm_start, + __entry->writable, + __entry->referenced, + __entry->none_or_zero, + __entry->collapse) +); + +TRACE_EVENT(mm_collapse_huge_page, + + TP_PROTO(struct mm_struct *mm, unsigned long vm_start, int isolated), + + TP_ARGS(mm, vm_start, isolated), + + TP_STRUCT__entry( + __field(struct mm_struct *, mm) + __field(unsigned long, vm_start) + __field(int, isolated) + ), + + TP_fast_assign( + __entry->mm = mm; + __entry->vm_start = vm_start; + __entry->isolated = isolated; + ), + + TP_printk("mm=%p, vm_start=%04lx, isolated=%d", + __entry->mm, + __entry->vm_start, + __entry->isolated) +); + +TRACE_EVENT(mm_collapse_huge_page_isolate, + + TP_PROTO(unsigned long vm_start, int none_or_zero, + bool referenced, bool writable), + + TP_ARGS(vm_start, none_or_zero, referenced, writable), + + TP_STRUCT__entry( + __field(unsigned long, vm_start) + __field(int, none_or_zero) + __field(bool, referenced) + __field(bool, writable) + ), + + TP_fast_assign( + __entry->vm_start = vm_start; + __entry->none_or_zero = none_or_zero; + __entry->referenced = referenced; + __entry->writable = writable; + ), + + TP_printk("vm_start=%04lx, none_or_zero=%d, referenced=%d, writable=%d", + __entry->vm_start, + __entry->none_or_zero, + __entry->referenced, + __entry->writable) +); + +#endif /* __HUGE_MEMORY_H */ +#include diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9671f51..9bb97fc 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -29,6 +29,9 @@ #include #include "internal.h" +#define CREATE_TRACE_POINTS +#include + /* * By default transparent hugepage support is disabled in order that avoid * to risk increase the memory footprint of applications without a guaranteed @@ -2266,6 +2269,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, if (likely(referenced && writable)) return 1; out: + trace_mm_collapse_huge_page_isolate(vma->vm_start, none_or_zero, + referenced, writable); release_pte_pages(pte, _pte); return 0; } @@ -2501,7 +2506,7 @@ static void collapse_huge_page(struct mm_struct *mm, pgtable_t pgtable; struct page *new_page; spinlock_t *pmd_ptl, *pte_ptl; - int isolated; + int isolated = 0; unsigned long hstart, hend; struct mem_cgroup *memcg; unsigned long mmun_start; /* For mmu_notifiers */ @@ -2619,6 +2624,7 @@ static void collapse_huge_page(struct mm_struct *mm, khugepaged_pages_collapsed++; out_up_write: up_write(&mm->mmap_sem); + trace_mm_collapse_huge_page(mm, vma->vm_start, isolated); return; out: @@ -2694,6 +2700,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, ret = 1; out_unmap: pte_unmap_unlock(pte, ptl); + trace_mm_khugepaged_scan_pmd(mm, vma->vm_start, writable, referenced, + none_or_zero, ret); if (ret) { node = khugepaged_find_target_node(); /* collapse_huge_page will return with the mmap_sem released */ -- 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753290AbbFNPFV (ORCPT ); Sun, 14 Jun 2015 11:05:21 -0400 Received: from mail-wi0-f170.google.com ([209.85.212.170]:33576 "EHLO mail-wi0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753136AbbFNPFG (ORCPT ); Sun, 14 Jun 2015 11:05:06 -0400 From: Ebru Akagunduz To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, riel@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com, Ebru Akagunduz Subject: [RFC 2/3] mm: make optimistic check for swapin readahead Date: Sun, 14 Jun 2015 18:04:42 +0300 Message-Id: <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch makes optimistic check for swapin readahead to increase thp collapse rate. Before getting swapped out pages to memory, checks them and allows up to a certain number. It also prints out using tracepoints amount of unmapped ptes. Signed-off-by: Ebru Akagunduz --- include/trace/events/huge_memory.h | 11 +++++++---- mm/huge_memory.c | 13 ++++++++++--- 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h index 4b9049b..53c9f2e 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -9,9 +9,9 @@ TRACE_EVENT(mm_khugepaged_scan_pmd, TP_PROTO(struct mm_struct *mm, unsigned long vm_start, bool writable, - bool referenced, int none_or_zero, int collapse), + bool referenced, int none_or_zero, int collapse, int unmapped), - TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse), + TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse, unmapped), TP_STRUCT__entry( __field(struct mm_struct *, mm) @@ -20,6 +20,7 @@ TRACE_EVENT(mm_khugepaged_scan_pmd, __field(bool, referenced) __field(int, none_or_zero) __field(int, collapse) + __field(int, unmapped) ), TP_fast_assign( @@ -29,15 +30,17 @@ TRACE_EVENT(mm_khugepaged_scan_pmd, __entry->referenced = referenced; __entry->none_or_zero = none_or_zero; __entry->collapse = collapse; + __entry->unmapped = unmapped; ), - TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d", + TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d, unmapped=%d", __entry->mm, __entry->vm_start, __entry->writable, __entry->referenced, __entry->none_or_zero, - __entry->collapse) + __entry->collapse, + __entry->unmapped) ); TRACE_EVENT(mm_collapse_huge_page, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9bb97fc..22bc0bf 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, { pmd_t *pmd; pte_t *pte, *_pte; - int ret = 0, none_or_zero = 0; + int ret = 0, none_or_zero = 0, unmapped = 0; struct page *page; unsigned long _address; spinlock_t *ptl; - int node = NUMA_NO_NODE; + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8; bool writable = false, referenced = false; VM_BUG_ON(address & ~HPAGE_PMD_MASK); @@ -2657,6 +2658,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++, _address += PAGE_SIZE) { pte_t pteval = *_pte; + if (is_swap_pte(pteval)) { + if (++unmapped <= max_ptes_swap) + continue; + else + goto out_unmap; + } if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { if (!userfaultfd_armed(vma) && ++none_or_zero <= khugepaged_max_ptes_none) @@ -2701,7 +2708,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, out_unmap: pte_unmap_unlock(pte, ptl); trace_mm_khugepaged_scan_pmd(mm, vma->vm_start, writable, referenced, - none_or_zero, ret); + none_or_zero, ret, unmapped); if (ret) { node = khugepaged_find_target_node(); /* collapse_huge_page will return with the mmap_sem released */ -- 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753435AbbFNPFf (ORCPT ); Sun, 14 Jun 2015 11:05:35 -0400 Received: from mail-wi0-f177.google.com ([209.85.212.177]:38615 "EHLO mail-wi0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751615AbbFNPFJ (ORCPT ); Sun, 14 Jun 2015 11:05:09 -0400 From: Ebru Akagunduz To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, riel@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com, Ebru Akagunduz Subject: [RFC 3/3] mm: make swapin readahead to improve thp collapse rate Date: Sun, 14 Jun 2015 18:04:43 +0300 Message-Id: <1434294283-8699-4-git-send-email-ebru.akagunduz@gmail.com> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch makes swapin readahead to improve thp collapse rate. When khugepaged scanned pages, there can be a few of the pages in swap area. With the patch THP can collapse 4kB pages into a THP when there are up to max_ptes_swap swap ptes in a 2MB range. The patch was tested with a test program that allocates 800MB of memory, writes to it, and then sleeps. I force the system to swap out all. Afterwards, the test program touches the area by writing, it skips a page in each 20 pages of the area. Without the patch, system did not swap in readahead. THP rate was %47 of the program of the memory, it did not change over time. With this patch, after 10 minutes of waiting khugepaged had collapsed %99 of the program's memory. Signed-off-by: Ebru Akagunduz --- I've written down test results: With the patch: After swapped out: cat /proc/pid/smaps: Anonymous: 470760 kB AnonHugePages: 468992 kB Swap: 329244 kB Fraction: %99 After swapped in: In ten minutes: cat /proc/pid/smaps: Anonymous: 769208 kB AnonHugePages: 765952 kB Swap: 30796 kB Fraction: %99 Without the patch: After swapped out: cat /proc/pid/smaps: Anonymous: 238160 kB AnonHugePages: 235520 kB Swap: 561844 kB Fraction: %98 After swapped in: cat /proc/pid/smaps: In ten minutes: Anonymous: 499956 kB AnonHugePages: 235520 kB Swap: 300048 kB Fraction: %47 include/linux/mm.h | 4 ++++ include/trace/events/huge_memory.h | 24 ++++++++++++++++++++++++ mm/huge_memory.c | 35 +++++++++++++++++++++++++++++++++++ mm/memory.c | 2 +- 4 files changed, 64 insertions(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 7f47178..f66ff8a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -29,6 +29,10 @@ struct user_struct; struct writeback_control; struct bdi_writeback; +extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *page_table, pmd_t *pmd, + unsigned int flags, pte_t orig_pte); + #ifndef CONFIG_NEED_MULTIPLE_NODES /* Don't use mapnrs, do it properly */ extern unsigned long max_mapnr; diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h index 53c9f2e..0117ab9 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -95,5 +95,29 @@ TRACE_EVENT(mm_collapse_huge_page_isolate, __entry->writable) ); +TRACE_EVENT(mm_collapse_huge_page_swapin, + + TP_PROTO(struct mm_struct *mm, unsigned long vm_start, int swap_pte), + + TP_ARGS(mm, vm_start, swap_pte), + + TP_STRUCT__entry( + __field(struct mm_struct *, mm) + __field(unsigned long, vm_start) + __field(int, swap_pte) + ), + + TP_fast_assign( + __entry->mm = mm; + __entry->vm_start = vm_start; + __entry->swap_pte = swap_pte; + ), + + TP_printk("mm=%p, vm_start=%04lx, swap_pte=%d", + __entry->mm, + __entry->vm_start, + __entry->swap_pte) +); + #endif /* __HUGE_MEMORY_H */ #include diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 22bc0bf..cb3e82a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2496,6 +2496,39 @@ static bool hugepage_vma_check(struct vm_area_struct *vma) return true; } +/* + * Bring missing pages in from swap, to complete THP collapse. + * Only done if khugepaged_scan_pmd believes it is worthwhile. + * + * Called and returns without pte mapped or spinlocks held, + * but with mmap_sem held to protect against vma changes. + */ + +static void __collapse_huge_page_swapin(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + pte_t *pte) +{ + unsigned long _address; + pte_t pteval = *pte; + int swap_pte = 0; + + pte = pte_offset_map(pmd, address); + for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE; + pte++, _address += PAGE_SIZE) { + pteval = *pte; + if (is_swap_pte(pteval)) { + swap_pte++; + do_swap_page(mm, vma, _address, pte, pmd, 0x0, pteval); + /* pte is unmapped now, we need to map it */ + pte = pte_offset_map(pmd, _address); + } + } + pte--; + pte_unmap(pte); + trace_mm_collapse_huge_page_swapin(mm, vma->vm_start, swap_pte); +} + static void collapse_huge_page(struct mm_struct *mm, unsigned long address, struct page **hpage, @@ -2551,6 +2584,8 @@ static void collapse_huge_page(struct mm_struct *mm, if (!pmd) goto out; + __collapse_huge_page_swapin(mm, vma, address, pmd, pte); + anon_vma_lock_write(vma->anon_vma); pte = pte_offset_map(pmd, address); diff --git a/mm/memory.c b/mm/memory.c index e1c45d0..d801dc5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2443,7 +2443,7 @@ EXPORT_SYMBOL(unmap_mapping_range); * We return with the mmap_sem locked or unlocked in the same cases * as does filemap_fault(). */ -static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, +int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, unsigned int flags, pte_t orig_pte) { -- 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753307AbbFOBEq (ORCPT ); Sun, 14 Jun 2015 21:04:46 -0400 Received: from mx1.redhat.com ([209.132.183.28]:41102 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752343AbbFOBEj (ORCPT ); Sun, 14 Jun 2015 21:04:39 -0400 Message-ID: <557E249B.6070208@redhat.com> Date: Sun, 14 Jun 2015 21:04:27 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Ebru Akagunduz , linux-mm@kvack.org CC: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com Subject: Re: [RFC 1/3] mm: add tracepoint for scanning pages References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-2-git-send-email-ebru.akagunduz@gmail.com> In-Reply-To: <1434294283-8699-2-git-send-email-ebru.akagunduz@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/14/2015 11:04 AM, Ebru Akagunduz wrote: > Using static tracepoints, data of functions is recorded. > It is good to automatize debugging without doing a lot > of changes in the source code. > > This patch adds tracepoint for khugepaged_scan_pmd, > collapse_huge_page and __collapse_huge_page_isolate. These trace points seem like a useful set to figure out what the THP collapse code is doing. > Signed-off-by: Ebru Akagunduz Acked-by: Rik van Riel -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753958AbbFOFke (ORCPT ); Mon, 15 Jun 2015 01:40:34 -0400 Received: from mail-wg0-f41.google.com ([74.125.82.41]:33821 "EHLO mail-wg0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751386AbbFOFkZ (ORCPT ); Mon, 15 Jun 2015 01:40:25 -0400 MIME-Version: 1.0 X-Originating-IP: [212.25.79.130] In-Reply-To: <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> From: Leon Romanovsky Date: Mon, 15 Jun 2015 08:40:03 +0300 Message-ID: Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead To: Ebru Akagunduz Cc: Linux-MM , Andrew Morton , kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange , riel@redhat.com, iamjoonsoo.kim@lge.com, Xiexiuqi , gorcunov@openvz.org, "linux-kernel@vger.kernel.org" , Mel Gorman , rientjes@google.com, Vlastimil Babka , aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, Johannes Weiner , mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Jun 14, 2015 at 6:04 PM, Ebru Akagunduz wrote: > This patch makes optimistic check for swapin readahead > to increase thp collapse rate. Before getting swapped > out pages to memory, checks them and allows up to a > certain number. It also prints out using tracepoints > amount of unmapped ptes. > > Signed-off-by: Ebru Akagunduz > --- > include/trace/events/huge_memory.h | 11 +++++++---- > mm/huge_memory.c | 13 ++++++++++--- > 2 files changed, 17 insertions(+), 7 deletions(-) > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h > index 4b9049b..53c9f2e 100644 > --- a/include/trace/events/huge_memory.h > +++ b/include/trace/events/huge_memory.h > @@ -9,9 +9,9 @@ > TRACE_EVENT(mm_khugepaged_scan_pmd, > > TP_PROTO(struct mm_struct *mm, unsigned long vm_start, bool writable, > - bool referenced, int none_or_zero, int collapse), > + bool referenced, int none_or_zero, int collapse, int unmapped), > > - TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse), > + TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse, unmapped), > > TP_STRUCT__entry( > __field(struct mm_struct *, mm) > @@ -20,6 +20,7 @@ TRACE_EVENT(mm_khugepaged_scan_pmd, > __field(bool, referenced) > __field(int, none_or_zero) > __field(int, collapse) > + __field(int, unmapped) > ), > > TP_fast_assign( > @@ -29,15 +30,17 @@ TRACE_EVENT(mm_khugepaged_scan_pmd, > __entry->referenced = referenced; > __entry->none_or_zero = none_or_zero; > __entry->collapse = collapse; > + __entry->unmapped = unmapped; > ), > > - TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d", > + TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d, unmapped=%d", > __entry->mm, > __entry->vm_start, > __entry->writable, > __entry->referenced, > __entry->none_or_zero, > - __entry->collapse) > + __entry->collapse, > + __entry->unmapped) > ); > > TRACE_EVENT(mm_collapse_huge_page, > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 9bb97fc..22bc0bf 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -24,6 +24,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > { > pmd_t *pmd; > pte_t *pte, *_pte; > - int ret = 0, none_or_zero = 0; > + int ret = 0, none_or_zero = 0, unmapped = 0; > struct page *page; > unsigned long _address; > spinlock_t *ptl; > - int node = NUMA_NO_NODE; > + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8; Sorry for asking, my knoweldge of THP is very limited, but why did you choose this default value? >>From the discussion followed by your patch (https://lkml.org/lkml/2015/2/27/432), I got an impression that it is not necessary right value. > bool writable = false, referenced = false; > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > @@ -2657,6 +2658,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR; > _pte++, _address += PAGE_SIZE) { > pte_t pteval = *_pte; > + if (is_swap_pte(pteval)) { > + if (++unmapped <= max_ptes_swap) > + continue; > + else > + goto out_unmap; > + } > if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > if (!userfaultfd_armed(vma) && > ++none_or_zero <= khugepaged_max_ptes_none) > @@ -2701,7 +2708,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > out_unmap: > pte_unmap_unlock(pte, ptl); > trace_mm_khugepaged_scan_pmd(mm, vma->vm_start, writable, referenced, > - none_or_zero, ret); > + none_or_zero, ret, unmapped); > if (ret) { > node = khugepaged_find_target_node(); > /* collapse_huge_page will return with the mmap_sem released */ > -- > 1.9.1 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Leon Romanovsky | Independent Linux Consultant www.leon.nu | leon@leon.nu From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753979AbbFOFnN (ORCPT ); Mon, 15 Jun 2015 01:43:13 -0400 Received: from mx1.redhat.com ([209.132.183.28]:41670 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751386AbbFOFnM (ORCPT ); Mon, 15 Jun 2015 01:43:12 -0400 Message-ID: <557E65E7.9010000@redhat.com> Date: Mon, 15 Jun 2015 01:43:03 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Leon Romanovsky , Ebru Akagunduz CC: Linux-MM , Andrew Morton , kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange , iamjoonsoo.kim@lge.com, Xiexiuqi , gorcunov@openvz.org, "linux-kernel@vger.kernel.org" , Mel Gorman , rientjes@google.com, Vlastimil Babka , aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, Johannes Weiner , mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/15/2015 01:40 AM, Leon Romanovsky wrote: > On Sun, Jun 14, 2015 at 6:04 PM, Ebru Akagunduz > wrote: >> This patch makes optimistic check for swapin readahead >> to increase thp collapse rate. Before getting swapped >> out pages to memory, checks them and allows up to a >> certain number. It also prints out using tracepoints >> amount of unmapped ptes. >> >> Signed-off-by: Ebru Akagunduz >> @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, >> { >> pmd_t *pmd; >> pte_t *pte, *_pte; >> - int ret = 0, none_or_zero = 0; >> + int ret = 0, none_or_zero = 0, unmapped = 0; >> struct page *page; >> unsigned long _address; >> spinlock_t *ptl; >> - int node = NUMA_NO_NODE; >> + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8; > Sorry for asking, my knoweldge of THP is very limited, but why did you > choose this default value? > From the discussion followed by your patch > (https://lkml.org/lkml/2015/2/27/432), I got an impression that it is > not necessary right value. I believe that Ebru's main focus for this initial version of the patch series was to get the _mechanism_ (patch 3) right, while having a fairly simple policy to drive it. Any suggestions on when it is a good idea to bring in pages from swap, and whether to treat resident-in-swap-cache pages differently from need-to-be-paged-in pages, and what other factors should be examined, are very welcome... -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753389AbbFOGId (ORCPT ); Mon, 15 Jun 2015 02:08:33 -0400 Received: from mail-wg0-f53.google.com ([74.125.82.53]:34872 "EHLO mail-wg0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751386AbbFOGIY (ORCPT ); Mon, 15 Jun 2015 02:08:24 -0400 MIME-Version: 1.0 X-Originating-IP: [212.25.79.130] In-Reply-To: <557E65E7.9010000@redhat.com> References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> <557E65E7.9010000@redhat.com> From: Leon Romanovsky Date: Mon, 15 Jun 2015 09:08:02 +0300 Message-ID: Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead To: Rik van Riel Cc: Ebru Akagunduz , Linux-MM , Andrew Morton , "kirill.shutemov" , n-horiguchi , aarcange , "iamjoonsoo.kim" , Xiexiuqi , gorcunov , "linux-kernel@vger.kernel.org" , Mel Gorman , rientjes , Vlastimil Babka , "aneesh.kumar" , hughd , Johannes Weiner , mhocko , boaz , raindel Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 15, 2015 at 8:43 AM, Rik van Riel wrote: > On 06/15/2015 01:40 AM, Leon Romanovsky wrote: >> On Sun, Jun 14, 2015 at 6:04 PM, Ebru Akagunduz >> wrote: >>> This patch makes optimistic check for swapin readahead >>> to increase thp collapse rate. Before getting swapped >>> out pages to memory, checks them and allows up to a >>> certain number. It also prints out using tracepoints >>> amount of unmapped ptes. >>> >>> Signed-off-by: Ebru Akagunduz > >>> @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, >>> { >>> pmd_t *pmd; >>> pte_t *pte, *_pte; >>> - int ret = 0, none_or_zero = 0; >>> + int ret = 0, none_or_zero = 0, unmapped = 0; >>> struct page *page; >>> unsigned long _address; >>> spinlock_t *ptl; >>> - int node = NUMA_NO_NODE; >>> + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8; >> Sorry for asking, my knoweldge of THP is very limited, but why did you >> choose this default value? >> From the discussion followed by your patch >> (https://lkml.org/lkml/2015/2/27/432), I got an impression that it is >> not necessary right value. > > I believe that Ebru's main focus for this initial version of > the patch series was to get the _mechanism_ (patch 3) right, > while having a fairly simple policy to drive it. > > Any suggestions on when it is a good idea to bring in pages > from swap, and whether to treat resident-in-swap-cache pages > differently from need-to-be-paged-in pages, and what other > factors should be examined, are very welcome... My concern with these patches that they deal with specific load/scenario (most of the application returned back from swap). In scenario there only 10% of data will be required, it theoretically can bring upto 80% data (70% waste). > > -- > All rights reversed -- Leon Romanovsky | Independent Linux Consultant www.leon.nu | leon@leon.nu From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753867AbbFOGfv (ORCPT ); Mon, 15 Jun 2015 02:35:51 -0400 Received: from mx1.redhat.com ([209.132.183.28]:43475 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752759AbbFOGfn (ORCPT ); Mon, 15 Jun 2015 02:35:43 -0400 Message-ID: <557E7235.1090105@redhat.com> Date: Mon, 15 Jun 2015 02:35:33 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Leon Romanovsky CC: Ebru Akagunduz , Linux-MM , Andrew Morton , "kirill.shutemov" , n-horiguchi , aarcange , "iamjoonsoo.kim" , Xiexiuqi , gorcunov , "linux-kernel@vger.kernel.org" , Mel Gorman , rientjes , Vlastimil Babka , "aneesh.kumar" , hughd , Johannes Weiner , mhocko , boaz , raindel Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> <557E65E7.9010000@redhat.com> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/15/2015 02:08 AM, Leon Romanovsky wrote: > On Mon, Jun 15, 2015 at 8:43 AM, Rik van Riel wrote: >> On 06/15/2015 01:40 AM, Leon Romanovsky wrote: >>> On Sun, Jun 14, 2015 at 6:04 PM, Ebru Akagunduz >>> wrote: >>>> This patch makes optimistic check for swapin readahead >>>> to increase thp collapse rate. Before getting swapped >>>> out pages to memory, checks them and allows up to a >>>> certain number. It also prints out using tracepoints >>>> amount of unmapped ptes. >>>> >>>> Signed-off-by: Ebru Akagunduz >> >>>> @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, >>>> { >>>> pmd_t *pmd; >>>> pte_t *pte, *_pte; >>>> - int ret = 0, none_or_zero = 0; >>>> + int ret = 0, none_or_zero = 0, unmapped = 0; >>>> struct page *page; >>>> unsigned long _address; >>>> spinlock_t *ptl; >>>> - int node = NUMA_NO_NODE; >>>> + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8; >>> Sorry for asking, my knoweldge of THP is very limited, but why did you >>> choose this default value? >>> From the discussion followed by your patch >>> (https://lkml.org/lkml/2015/2/27/432), I got an impression that it is >>> not necessary right value. >> >> I believe that Ebru's main focus for this initial version of >> the patch series was to get the _mechanism_ (patch 3) right, >> while having a fairly simple policy to drive it. >> >> Any suggestions on when it is a good idea to bring in pages >> from swap, and whether to treat resident-in-swap-cache pages >> differently from need-to-be-paged-in pages, and what other >> factors should be examined, are very welcome... > My concern with these patches that they deal with specific > load/scenario (most of the application returned back from swap). In > scenario there only 10% of data will be required, it theoretically can > bring upto 80% data (70% waste). The chosen threshold ensures that the remaining non-resident 4kB pages in a THP are only brought in if 7/8th (or 87.5%) of the pages are already resident. -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754255AbbFOOAI (ORCPT ); Mon, 15 Jun 2015 10:00:08 -0400 Received: from mx1.redhat.com ([209.132.183.28]:44138 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754161AbbFON75 (ORCPT ); Mon, 15 Jun 2015 09:59:57 -0400 Message-ID: <557EDA53.3040906@redhat.com> Date: Mon, 15 Jun 2015 09:59:47 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Ebru Akagunduz , linux-mm@kvack.org CC: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com Subject: Re: [RFC 3/3] mm: make swapin readahead to improve thp collapse rate References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-4-git-send-email-ebru.akagunduz@gmail.com> In-Reply-To: <1434294283-8699-4-git-send-email-ebru.akagunduz@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/14/2015 11:04 AM, Ebru Akagunduz wrote: > This patch makes swapin readahead to improve thp collapse rate. > When khugepaged scanned pages, there can be a few of the pages > in swap area. > > With the patch THP can collapse 4kB pages into a THP when > there are up to max_ptes_swap swap ptes in a 2MB range. > > The patch was tested with a test program that allocates > 800MB of memory, writes to it, and then sleeps. I force > the system to swap out all. Afterwards, the test program > touches the area by writing, it skips a page in each > 20 pages of the area. > > Without the patch, system did not swap in readahead. > THP rate was %47 of the program of the memory, it > did not change over time. > > With this patch, after 10 minutes of waiting khugepaged had > collapsed %99 of the program's memory. > > Signed-off-by: Ebru Akagunduz Mechanism looks good to me. Acked-by: Rik van Riel -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932159AbbFOOFj (ORCPT ); Mon, 15 Jun 2015 10:05:39 -0400 Received: from mx1.redhat.com ([209.132.183.28]:40060 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932116AbbFOOFa (ORCPT ); Mon, 15 Jun 2015 10:05:30 -0400 Message-ID: <557EDBA2.9090308@redhat.com> Date: Mon, 15 Jun 2015 10:05:22 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Ebru Akagunduz , linux-mm@kvack.org CC: akpm@linux-foundation.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> In-Reply-To: <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/14/2015 11:04 AM, Ebru Akagunduz wrote: > This patch makes optimistic check for swapin readahead > to increase thp collapse rate. Before getting swapped > out pages to memory, checks them and allows up to a > certain number. It also prints out using tracepoints > amount of unmapped ptes. > > Signed-off-by: Ebru Akagunduz > @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > { > pmd_t *pmd; > pte_t *pte, *_pte; > - int ret = 0, none_or_zero = 0; > + int ret = 0, none_or_zero = 0, unmapped = 0; > struct page *page; > unsigned long _address; > spinlock_t *ptl; > - int node = NUMA_NO_NODE; > + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8; > bool writable = false, referenced = false; This has the effect of only swapping in 4kB pages to form a THP if 7/8th of the THP is already resident in memory. This is a pretty conservative thing to do. I am not sure if we would also need to take into account things like these: 1) How many pages in the THP-area are recently referenced? Maybe this does not matter if 87.5% of the 4kB pages got faulted in after swap-out, anyway? 2) How much free memory does the system have? We don't test that for collapsing a THP with lots of pte_none() ptes, so not sure how much this matters... 3) How many of the pages we want to swap in are already resident in the swap cache? Not sure exactly what to do with this number... 4) other factors? I am also not sure how we would determine such a policy, except by maybe having these patches sit in -mm and -next for a few cycles, and seeing what happens... -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754818AbbFOQIJ (ORCPT ); Mon, 15 Jun 2015 12:08:09 -0400 Received: from mail-wi0-f171.google.com ([209.85.212.171]:37968 "EHLO mail-wi0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754117AbbFOQIH (ORCPT ); Mon, 15 Jun 2015 12:08:07 -0400 MIME-Version: 1.0 X-Originating-IP: [213.57.247.249] In-Reply-To: <557EDBA2.9090308@redhat.com> References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-3-git-send-email-ebru.akagunduz@gmail.com> <557EDBA2.9090308@redhat.com> From: Leon Romanovsky Date: Mon, 15 Jun 2015 19:07:45 +0300 Message-ID: Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead To: Rik van Riel Cc: Ebru Akagunduz , Linux-MM , Andrew Morton , "kirill.shutemov" , n-horiguchi , aarcange , "iamjoonsoo.kim" , Xiexiuqi , gorcunov , "linux-kernel@vger.kernel.org" , Mel Gorman , rientjes , Vlastimil Babka , "aneesh.kumar" , Hugh Dickins , Johannes Weiner , mhocko , boaz , raindel Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 15, 2015 at 5:05 PM, Rik van Riel wrote: > > On 06/14/2015 11:04 AM, Ebru Akagunduz wrote: > > This patch makes optimistic check for swapin readahead > > to increase thp collapse rate. Before getting swapped > > out pages to memory, checks them and allows up to a > > certain number. It also prints out using tracepoints > > amount of unmapped ptes. > > > > Signed-off-by: Ebru Akagunduz > > > @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > > { > > pmd_t *pmd; > > pte_t *pte, *_pte; > > - int ret = 0, none_or_zero = 0; > > + int ret = 0, none_or_zero = 0, unmapped = 0; > > struct page *page; > > unsigned long _address; > > spinlock_t *ptl; > > - int node = NUMA_NO_NODE; > > + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8; > > bool writable = false, referenced = false; > > This has the effect of only swapping in 4kB pages to form a THP > if 7/8th of the THP is already resident in memory. Thanks for clarifing it to me. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757151AbbFPVPn (ORCPT ); Tue, 16 Jun 2015 17:15:43 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:36605 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754792AbbFPVPl (ORCPT ); Tue, 16 Jun 2015 17:15:41 -0400 Date: Tue, 16 Jun 2015 14:15:40 -0700 From: Andrew Morton To: Ebru Akagunduz Cc: linux-mm@kvack.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, riel@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com Subject: Re: [RFC 3/3] mm: make swapin readahead to improve thp collapse rate Message-Id: <20150616141540.adc40130139151bf19f07ff9@linux-foundation.org> In-Reply-To: <1434294283-8699-4-git-send-email-ebru.akagunduz@gmail.com> References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-4-git-send-email-ebru.akagunduz@gmail.com> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 14 Jun 2015 18:04:43 +0300 Ebru Akagunduz wrote: > This patch makes swapin readahead to improve thp collapse rate. > When khugepaged scanned pages, there can be a few of the pages > in swap area. > > With the patch THP can collapse 4kB pages into a THP when > there are up to max_ptes_swap swap ptes in a 2MB range. > > The patch was tested with a test program that allocates > 800MB of memory, writes to it, and then sleeps. I force > the system to swap out all. Afterwards, the test program > touches the area by writing, it skips a page in each > 20 pages of the area. > > Without the patch, system did not swap in readahead. > THP rate was %47 of the program of the memory, it > did not change over time. > > With this patch, after 10 minutes of waiting khugepaged had > collapsed %99 of the program's memory. > > ... > > +/* > + * Bring missing pages in from swap, to complete THP collapse. > + * Only done if khugepaged_scan_pmd believes it is worthwhile. > + * > + * Called and returns without pte mapped or spinlocks held, > + * but with mmap_sem held to protect against vma changes. > + */ > + > +static void __collapse_huge_page_swapin(struct mm_struct *mm, > + struct vm_area_struct *vma, > + unsigned long address, pmd_t *pmd, > + pte_t *pte) > +{ > + unsigned long _address; > + pte_t pteval = *pte; > + int swap_pte = 0; > + > + pte = pte_offset_map(pmd, address); > + for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE; > + pte++, _address += PAGE_SIZE) { > + pteval = *pte; > + if (is_swap_pte(pteval)) { > + swap_pte++; > + do_swap_page(mm, vma, _address, pte, pmd, 0x0, pteval); > + /* pte is unmapped now, we need to map it */ > + pte = pte_offset_map(pmd, _address); > + } > + } > + pte--; > + pte_unmap(pte); > + trace_mm_collapse_huge_page_swapin(mm, vma->vm_start, swap_pte); > +} This is doing a series of synchronous reads. That will be sloooow on spinning disks. This function should be significantly faster if it first gets all the necessary I/O underway. I don't think we have a function which exactly does this. Perhaps generalise swapin_readahead() or open-code something like blk_start_plug(...); for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE; pte++, _address += PAGE_SIZE) { if (is_swap_pte(*pte)) { read_swap_cache_async(...); } } blk_finish_plug(...); If you do make a change such as this, please benchmark its effects. Not on SSD ;) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757570AbbFQDUi (ORCPT ); Tue, 16 Jun 2015 23:20:38 -0400 Received: from mx1.redhat.com ([209.132.183.28]:51134 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752723AbbFQDUa (ORCPT ); Tue, 16 Jun 2015 23:20:30 -0400 Message-ID: <5580E774.3070307@redhat.com> Date: Tue, 16 Jun 2015 23:20:20 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Andrew Morton , Ebru Akagunduz CC: linux-mm@kvack.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com Subject: Re: [RFC 3/3] mm: make swapin readahead to improve thp collapse rate References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-4-git-send-email-ebru.akagunduz@gmail.com> <20150616141540.adc40130139151bf19f07ff9@linux-foundation.org> In-Reply-To: <20150616141540.adc40130139151bf19f07ff9@linux-foundation.org> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/16/2015 05:15 PM, Andrew Morton wrote: > On Sun, 14 Jun 2015 18:04:43 +0300 Ebru Akagunduz wrote: > >> This patch makes swapin readahead to improve thp collapse rate. >> When khugepaged scanned pages, there can be a few of the pages >> in swap area. >> >> With the patch THP can collapse 4kB pages into a THP when >> there are up to max_ptes_swap swap ptes in a 2MB range. >> >> The patch was tested with a test program that allocates >> 800MB of memory, writes to it, and then sleeps. I force >> the system to swap out all. Afterwards, the test program >> touches the area by writing, it skips a page in each >> 20 pages of the area. >> >> Without the patch, system did not swap in readahead. >> THP rate was %47 of the program of the memory, it >> did not change over time. >> >> With this patch, after 10 minutes of waiting khugepaged had >> collapsed %99 of the program's memory. >> >> ... >> >> +/* >> + * Bring missing pages in from swap, to complete THP collapse. >> + * Only done if khugepaged_scan_pmd believes it is worthwhile. >> + * >> + * Called and returns without pte mapped or spinlocks held, >> + * but with mmap_sem held to protect against vma changes. >> + */ >> + >> +static void __collapse_huge_page_swapin(struct mm_struct *mm, >> + struct vm_area_struct *vma, >> + unsigned long address, pmd_t *pmd, >> + pte_t *pte) >> +{ >> + unsigned long _address; >> + pte_t pteval = *pte; >> + int swap_pte = 0; >> + >> + pte = pte_offset_map(pmd, address); >> + for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE; >> + pte++, _address += PAGE_SIZE) { >> + pteval = *pte; >> + if (is_swap_pte(pteval)) { >> + swap_pte++; >> + do_swap_page(mm, vma, _address, pte, pmd, 0x0, pteval); >> + /* pte is unmapped now, we need to map it */ >> + pte = pte_offset_map(pmd, _address); >> + } >> + } >> + pte--; >> + pte_unmap(pte); >> + trace_mm_collapse_huge_page_swapin(mm, vma->vm_start, swap_pte); >> +} > > This is doing a series of synchronous reads. That will be sloooow on > spinning disks. > > This function should be significantly faster if it first gets all the > necessary I/O underway. I don't think we have a function which exactly > does this. Perhaps generalise swapin_readahead() or open-code > something like Looking at do_swap_page() and __lock_page_or_retry(), I guess there already is a way to do the above. Passing a "flags" of FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT to do_swap_page() should result in do_swap_page() returning with the pte unmapped and the mmap_sem still held if the page was not immediately available to map into the pte (trylock_page succeeds). Ebru, can you try passing the above as the flags argument to do_swap_page(), and see what happens? -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757070AbbFQRjL (ORCPT ); Wed, 17 Jun 2015 13:39:11 -0400 Received: from mail-wi0-f180.google.com ([209.85.212.180]:38330 "EHLO mail-wi0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753449AbbFQRjB (ORCPT ); Wed, 17 Jun 2015 13:39:01 -0400 Date: Wed, 17 Jun 2015 20:38:56 +0300 From: Ebru Akagunduz To: Rik van Riel Cc: linux-mm@kvack.org, kirill.shutemov@linux.intel.com, n-horiguchi@ah.jp.nec.com, aarcange@redhat.com, iamjoonsoo.kim@lge.com, xiexiuqi@huawei.com, gorcunov@openvz.org, linux-kernel@vger.kernel.org, mgorman@suse.de, rientjes@google.com, vbabka@suse.cz, aneesh.kumar@linux.vnet.ibm.com, hughd@google.com, hannes@cmpxchg.org, mhocko@suse.cz, boaz@plexistor.com, raindel@mellanox.com Subject: Re: [RFC 3/3] mm: make swapin readahead to improve thp collapse rate Message-ID: <20150617173856.GA3970@debian> References: <1434294283-8699-1-git-send-email-ebru.akagunduz@gmail.com> <1434294283-8699-4-git-send-email-ebru.akagunduz@gmail.com> <20150616141540.adc40130139151bf19f07ff9@linux-foundation.org> <5580E774.3070307@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5580E774.3070307@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 16, 2015 at 11:20:20PM -0400, Rik van Riel wrote: > On 06/16/2015 05:15 PM, Andrew Morton wrote: > > On Sun, 14 Jun 2015 18:04:43 +0300 Ebru Akagunduz wrote: > > > >> This patch makes swapin readahead to improve thp collapse rate. > >> When khugepaged scanned pages, there can be a few of the pages > >> in swap area. > >> > >> With the patch THP can collapse 4kB pages into a THP when > >> there are up to max_ptes_swap swap ptes in a 2MB range. > >> > >> The patch was tested with a test program that allocates > >> 800MB of memory, writes to it, and then sleeps. I force > >> the system to swap out all. Afterwards, the test program > >> touches the area by writing, it skips a page in each > >> 20 pages of the area. > >> > >> Without the patch, system did not swap in readahead. > >> THP rate was %47 of the program of the memory, it > >> did not change over time. > >> > >> With this patch, after 10 minutes of waiting khugepaged had > >> collapsed %99 of the program's memory. > >> > >> ... > >> > >> +/* > >> + * Bring missing pages in from swap, to complete THP collapse. > >> + * Only done if khugepaged_scan_pmd believes it is worthwhile. > >> + * > >> + * Called and returns without pte mapped or spinlocks held, > >> + * but with mmap_sem held to protect against vma changes. > >> + */ > >> + > >> +static void __collapse_huge_page_swapin(struct mm_struct *mm, > >> + struct vm_area_struct *vma, > >> + unsigned long address, pmd_t *pmd, > >> + pte_t *pte) > >> +{ > >> + unsigned long _address; > >> + pte_t pteval = *pte; > >> + int swap_pte = 0; > >> + > >> + pte = pte_offset_map(pmd, address); > >> + for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE; > >> + pte++, _address += PAGE_SIZE) { > >> + pteval = *pte; > >> + if (is_swap_pte(pteval)) { > >> + swap_pte++; > >> + do_swap_page(mm, vma, _address, pte, pmd, 0x0, pteval); > >> + /* pte is unmapped now, we need to map it */ > >> + pte = pte_offset_map(pmd, _address); > >> + } > >> + } > >> + pte--; > >> + pte_unmap(pte); > >> + trace_mm_collapse_huge_page_swapin(mm, vma->vm_start, swap_pte); > >> +} > > > > This is doing a series of synchronous reads. That will be sloooow on > > spinning disks. > > > > This function should be significantly faster if it first gets all the > > necessary I/O underway. I don't think we have a function which exactly > > does this. Perhaps generalise swapin_readahead() or open-code > > something like > > Looking at do_swap_page() and __lock_page_or_retry(), I guess > there already is a way to do the above. > > Passing a "flags" of FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT > to do_swap_page() should result in do_swap_page() returning with > the pte unmapped and the mmap_sem still held if the page was not > immediately available to map into the pte (trylock_page succeeds). > > Ebru, can you try passing the above as the flags argument to > do_swap_page(), and see what happens? I will try and resent the patch series. Thanks for suggestions. Ebru