[RFC v5 0/3] mm: make swapin readahead to gain more thp performance

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC v5 0/3] mm: make swapin readahead to gain more thp performance
@ 2015-09-14 19:31 Ebru Akagunduz
  2015-09-14 19:31 ` [RFC v5 1/3] mm: add tracepoint for scanning pages Ebru Akagunduz
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Ebru Akagunduz @ 2015-09-14 19:31 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, kirill.shutemov, n-horiguchi, aarcange, riel,
	iamjoonsoo.kim, xiexiuqi, gorcunov, linux-kernel, mgorman,
	rientjes, vbabka, aneesh.kumar, hughd, hannes, mhocko, boaz,
	raindel, Ebru Akagunduz

This patch series makes swapin readahead up to a
certain number to gain more thp performance and adds
tracepoint for khugepaged_scan_pmd, collapse_huge_page,
__collapse_huge_page_isolate.

This patch series was written to deal with programs
that access most, but not all, of their memory after
they get swapped out. Currently these programs do not
get their memory collapsed into THPs after the system
swapped their memory out, while they would get THPs
before swapping happened.

This patch series was tested with a test program,
it allocates 400MB of memory, writes to it, and
then sleeps. I force the system to swap out all.
Afterwards, the test program touches the area by
writing and leaves a piece of it without writing.
This shows how much swap in readahead made by the
patch.

Test results:

                        After swapped out
-------------------------------------------------------------------
              | Anonymous | AnonHugePages | Swap      | Fraction  |
-------------------------------------------------------------------
With patch    | 90076 kB    | 88064 kB    | 309928 kB |    %99    |
-------------------------------------------------------------------
Without patch | 194068 kB | 192512 kB     | 205936 kB |    %99    |
-------------------------------------------------------------------

                        After swapped in
-------------------------------------------------------------------
              | Anonymous | AnonHugePages | Swap      | Fraction  |
-------------------------------------------------------------------
With patch    | 201408 kB | 198656 kB     | 198596 kB |    %98    |
-------------------------------------------------------------------
Without patch | 292624 kB | 192512 kB     | 107380 kB |    %65    |
-------------------------------------------------------------------

Ebru Akagunduz (3):
  mm: add tracepoint for scanning pages
  mm: make optimistic check for swapin readahead
  mm: make swapin readahead to improve thp collapse rate

 include/trace/events/huge_memory.h | 165 ++++++++++++++++++++++++
 mm/huge_memory.c                   | 248 ++++++++++++++++++++++++++++++++-----
 mm/internal.h                      |   4 +
 mm/memory.c                        |   2 +-
 4 files changed, 386 insertions(+), 33 deletions(-)
 create mode 100644 include/trace/events/huge_memory.h

-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC v5 1/3] mm: add tracepoint for scanning pages
  2015-09-14 19:31 [RFC v5 0/3] mm: make swapin readahead to gain more thp performance Ebru Akagunduz
@ 2015-09-14 19:31 ` Ebru Akagunduz
  2015-09-14 19:31 ` [RFC v5 2/3] mm: make optimistic check for swapin readahead Ebru Akagunduz
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Ebru Akagunduz @ 2015-09-14 19:31 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, kirill.shutemov, n-horiguchi, aarcange, riel,
	iamjoonsoo.kim, xiexiuqi, gorcunov, linux-kernel, mgorman,
	rientjes, vbabka, aneesh.kumar, hughd, hannes, mhocko, boaz,
	raindel, Ebru Akagunduz

Using static tracepoints, data of functions is recorded.
It is good to automatize debugging without doing a lot
of changes in the source code.

This patch adds tracepoint for khugepaged_scan_pmd,
collapse_huge_page and __collapse_huge_page_isolate.

Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
Changes in v2:
 - Nothing changed

Changes in v3:
 - Print page address instead of vm_start (Vlastimil Babka)
 - Define constants to specify exact tracepoint result (Vlastimil Babka)

Changes in v4:
 - Change the constant prefix with SCAN_ instead of MM_ (Vlastimil Babka)
 - Move the constants into the enum (Vlastimil Babka)
 - Move the constants from mm.h to huge_memory.c
   (because only will be used in huge_memory.c) (Vlastimil Babka)
 - Print pfn in tracepoints (Vlastimil Babka)
 - Print scan result as string in tracepoint (Vlastimil Babka)
   (I tried to make same things to print string like mm/compaction.c.
    My patch does not print string, I skip something but could not see why)

 - Do not change function return values for success and failure,
   leave them original agreed with Doc/CodingStyle (Vlastimil Babka)
 - Define scan_result to specify tracepoint result (Ebru Akagunduz)
 - Add out_nolock label to avoid multiple tracepoints (Vlastimil Babka)

Changes in v5:
 - Use tracepoint macros to print string in userspace
   (fixes printing string problem) (Vlastimil Babka)

 include/trace/events/huge_memory.h | 137 +++++++++++++++++++++++++++++++
 mm/huge_memory.c                   | 164 ++++++++++++++++++++++++++++++-------
 2 files changed, 270 insertions(+), 31 deletions(-)
 create mode 100644 include/trace/events/huge_memory.h

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
new file mode 100644
index 0000000..1df9bf5
--- /dev/null
+++ b/include/trace/events/huge_memory.h
@@ -0,0 +1,137 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM huge_memory
+
+#if !defined(__HUGE_MEMORY_H) || defined(TRACE_HEADER_MULTI_READ)
+#define __HUGE_MEMORY_H
+
+#include  <linux/tracepoint.h>
+
+#include <trace/events/gfpflags.h>
+
+#define SCAN_STATUS							\
+	EM( SCAN_FAIL,			"failed")			\
+	EM( SCAN_SUCCEED,		"succeeded")			\
+	EM( SCAN_PMD_NULL,		"pmd_null")			\
+	EM( SCAN_EXCEED_NONE_PTE,	"exceed_none_pte")		\
+	EM( SCAN_PTE_NON_PRESENT,	"pte_non_present")		\
+	EM( SCAN_PAGE_RO,		"no_writable_page")		\
+	EM( SCAN_NO_REFERENCED_PAGE,	"no_referenced_page")		\
+	EM( SCAN_PAGE_NULL,		"page_null")			\
+	EM( SCAN_SCAN_ABORT,		"scan_aborted")			\
+	EM( SCAN_PAGE_COUNT,		"not_suitable_page_count")	\
+	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
+	EM( SCAN_PAGE_LOCK,		"page_locked")			\
+	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
+	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
+	EM( SCAN_VMA_NULL,		"vma_null")			\
+	EM( SCAN_VMA_CHECK,		"vma_check_failed")		\
+	EM( SCAN_ADDRESS_RANGE,		"not_suitable_address_range")	\
+	EM( SCAN_SWAP_CACHE_PAGE,	"page_swap_cache")		\
+	EM( SCAN_DEL_PAGE_LRU,		"could_not_delete_page_from_lru")\
+	EM( SCAN_ALLOC_HUGE_PAGE_FAIL,	"alloc_huge_page_failed")	\
+	EMe( SCAN_CGROUP_CHARGE_FAIL,	"ccgroup_charge_failed")
+
+#undef EM
+#undef EMe
+#define EM(a, b)	TRACE_DEFINE_ENUM(a);
+#define EMe(a, b)	TRACE_DEFINE_ENUM(a);
+
+SCAN_STATUS
+
+#undef EM
+#undef EMe
+#define EM(a, b)	{a, b},
+#define EMe(a, b)	{a, b}
+
+TRACE_EVENT(mm_khugepaged_scan_pmd,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long pfn, bool writable,
+		 bool referenced, int none_or_zero, int status),
+
+	TP_ARGS(mm, pfn, writable, referenced, none_or_zero, status),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, pfn)
+		__field(bool, writable)
+		__field(bool, referenced)
+		__field(int, none_or_zero)
+		__field(int, status)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->pfn = pfn;
+		__entry->writable = writable;
+		__entry->referenced = referenced;
+		__entry->none_or_zero = none_or_zero;
+		__entry->status = status;
+	),
+
+	TP_printk("mm=%p, scan_pfn=0x%lx, writable=%d, referenced=%d, none_or_zero=%d, status=%s",
+		__entry->mm,
+		__entry->pfn,
+		__entry->writable,
+		__entry->referenced,
+		__entry->none_or_zero,
+		__print_symbolic(__entry->status, SCAN_STATUS))
+);
+
+TRACE_EVENT(mm_collapse_huge_page,
+
+	TP_PROTO(struct mm_struct *mm, int isolated, int status),
+
+	TP_ARGS(mm, isolated, status),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(int, isolated)
+		__field(int, status)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->isolated = isolated;
+		__entry->status = status;
+	),
+
+	TP_printk("mm=%p, isolated=%d, status=%s",
+		__entry->mm,
+		__entry->isolated,
+		__print_symbolic(__entry->status, SCAN_STATUS))
+);
+
+TRACE_EVENT(mm_collapse_huge_page_isolate,
+
+	TP_PROTO(unsigned long pfn, int none_or_zero,
+		 bool referenced, bool  writable, int status),
+
+	TP_ARGS(pfn, none_or_zero, referenced, writable, status),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+		__field(int, none_or_zero)
+		__field(bool, referenced)
+		__field(bool, writable)
+		__field(int, status)
+	),
+
+	TP_fast_assign(
+		__entry->pfn = pfn;
+		__entry->none_or_zero = none_or_zero;
+		__entry->referenced = referenced;
+		__entry->writable = writable;
+		__entry->status = status;
+	),
+
+	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s",
+		__entry->pfn,
+		__entry->none_or_zero,
+		__entry->referenced,
+		__entry->writable,
+		__print_symbolic(__entry->status, SCAN_STATUS))
+);
+
+#endif /* __HUGE_MEMORY_H */
+#include <trace/define_trace.h>
+
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4b06b8d..4215cee 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -31,6 +31,33 @@
 #include <asm/pgalloc.h>
 #include "internal.h"
 
+enum scan_result {
+	SCAN_FAIL,
+	SCAN_SUCCEED,
+	SCAN_PMD_NULL,
+	SCAN_EXCEED_NONE_PTE,
+	SCAN_PTE_NON_PRESENT,
+	SCAN_PAGE_RO,
+	SCAN_NO_REFERENCED_PAGE,
+	SCAN_PAGE_NULL,
+	SCAN_SCAN_ABORT,
+	SCAN_PAGE_COUNT,
+	SCAN_PAGE_LRU,
+	SCAN_PAGE_LOCK,
+	SCAN_PAGE_ANON,
+	SCAN_ANY_PROCESS,
+	SCAN_VMA_NULL,
+	SCAN_VMA_CHECK,
+	SCAN_ADDRESS_RANGE,
+	SCAN_SWAP_CACHE_PAGE,
+	SCAN_DEL_PAGE_LRU,
+	SCAN_ALLOC_HUGE_PAGE_FAIL,
+	SCAN_CGROUP_CHARGE_FAIL
+};
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/huge_memory.h>
+
 /*
  * By default transparent hugepage support is disabled in order that avoid
  * to risk increase the memory footprint of applications without a guaranteed
@@ -2199,25 +2226,31 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 					unsigned long address,
 					pte_t *pte)
 {
-	struct page *page;
+	struct page *page = NULL;
 	pte_t *_pte;
-	int none_or_zero = 0;
+	int none_or_zero = 0, result = 0;
 	bool referenced = false, writable = false;
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			if (!userfaultfd_armed(vma) &&
-			    ++none_or_zero <= khugepaged_max_ptes_none)
+			    ++none_or_zero <= khugepaged_max_ptes_none) {
 				continue;
-			else
+			} else {
+				result = SCAN_EXCEED_NONE_PTE;
 				goto out;
+			}
 		}
-		if (!pte_present(pteval))
+		if (!pte_present(pteval)) {
+			result = SCAN_PTE_NON_PRESENT;
 			goto out;
+		}
 		page = vm_normal_page(vma, address, pteval);
-		if (unlikely(!page))
+		if (unlikely(!page)) {
+			result = SCAN_PAGE_NULL;
 			goto out;
+		}
 
 		VM_BUG_ON_PAGE(PageCompound(page), page);
 		VM_BUG_ON_PAGE(!PageAnon(page), page);
@@ -2229,8 +2262,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		 * is needed to serialize against split_huge_page
 		 * when invoked from the VM.
 		 */
-		if (!trylock_page(page))
+		if (!trylock_page(page)) {
+			result = SCAN_PAGE_LOCK;
 			goto out;
+		}
 
 		/*
 		 * cannot use mapcount: can't collapse if there's a gup pin.
@@ -2239,6 +2274,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		 */
 		if (page_count(page) != 1 + !!PageSwapCache(page)) {
 			unlock_page(page);
+			result = SCAN_PAGE_COUNT;
 			goto out;
 		}
 		if (pte_write(pteval)) {
@@ -2246,6 +2282,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		} else {
 			if (PageSwapCache(page) && !reuse_swap_page(page)) {
 				unlock_page(page);
+				result = SCAN_SWAP_CACHE_PAGE;
 				goto out;
 			}
 			/*
@@ -2260,6 +2297,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		 */
 		if (isolate_lru_page(page)) {
 			unlock_page(page);
+			result = SCAN_DEL_PAGE_LRU;
 			goto out;
 		}
 		/* 0 stands for page_is_file_cache(page) == false */
@@ -2273,10 +2311,21 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		    mmu_notifier_test_young(vma->vm_mm, address))
 			referenced = true;
 	}
-	if (likely(referenced && writable))
-		return 1;
+	if (likely(writable)) {
+		if (likely(referenced)) {
+			result = SCAN_SUCCEED;
+			trace_mm_collapse_huge_page_isolate(page_to_pfn(page), none_or_zero,
+							    referenced, writable, result);
+			return 1;
+		}
+	} else {
+		result = SCAN_PAGE_RO;
+	}
+
 out:
 	release_pte_pages(pte, _pte);
+	trace_mm_collapse_huge_page_isolate(page_to_pfn(page), none_or_zero,
+					    referenced, writable, result);
 	return 0;
 }
 
@@ -2515,7 +2564,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	pgtable_t pgtable;
 	struct page *new_page;
 	spinlock_t *pmd_ptl, *pte_ptl;
-	int isolated;
+	int isolated, result = 0;
 	unsigned long hstart, hend;
 	struct mem_cgroup *memcg;
 	unsigned long mmun_start;	/* For mmu_notifiers */
@@ -2530,12 +2579,16 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	/* release the mmap_sem read lock. */
 	new_page = khugepaged_alloc_page(hpage, gfp, mm, vma, address, node);
-	if (!new_page)
-		return;
+	if (!new_page) {
+		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
+		goto out_nolock;
+	}
 
 	if (unlikely(mem_cgroup_try_charge(new_page, mm,
-					   gfp, &memcg)))
-		return;
+					   gfp, &memcg))) {
+		result = SCAN_CGROUP_CHARGE_FAIL;
+		goto out_nolock;
+	}
 
 	/*
 	 * Prevent all access to pagetables with the exception of
@@ -2543,21 +2596,31 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * handled by the anon_vma lock + PG_lock.
 	 */
 	down_write(&mm->mmap_sem);
-	if (unlikely(khugepaged_test_exit(mm)))
+	if (unlikely(khugepaged_test_exit(mm))) {
+		result = SCAN_ANY_PROCESS;
 		goto out;
+	}
 
 	vma = find_vma(mm, address);
-	if (!vma)
+	if (!vma) {
+		result = SCAN_VMA_NULL;
 		goto out;
+	}
 	hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
 	hend = vma->vm_end & HPAGE_PMD_MASK;
-	if (address < hstart || address + HPAGE_PMD_SIZE > hend)
+	if (address < hstart || address + HPAGE_PMD_SIZE > hend) {
+		result = SCAN_ADDRESS_RANGE;
 		goto out;
-	if (!hugepage_vma_check(vma))
+	}
+	if (!hugepage_vma_check(vma)) {
+		result = SCAN_VMA_CHECK;
 		goto out;
+	}
 	pmd = mm_find_pmd(mm, address);
-	if (!pmd)
+	if (!pmd) {
+		result = SCAN_PMD_NULL;
 		goto out;
+	}
 
 	anon_vma_lock_write(vma->anon_vma);
 
@@ -2594,6 +2657,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
 		spin_unlock(pmd_ptl);
 		anon_vma_unlock_write(vma->anon_vma);
+		result = SCAN_FAIL;
 		goto out;
 	}
 
@@ -2631,10 +2695,15 @@ static void collapse_huge_page(struct mm_struct *mm,
 	*hpage = NULL;
 
 	khugepaged_pages_collapsed++;
+	result = SCAN_SUCCEED;
 out_up_write:
 	up_write(&mm->mmap_sem);
+	trace_mm_collapse_huge_page(mm, isolated, result);
 	return;
 
+out_nolock:
+	trace_mm_collapse_huge_page(mm, isolated, result);
+	return;
 out:
 	mem_cgroup_cancel_charge(new_page, memcg);
 	goto out_up_write;
@@ -2647,8 +2716,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
-	int ret = 0, none_or_zero = 0;
-	struct page *page;
+	int ret = 0, none_or_zero = 0, result = 0;
+	struct page *page = NULL;
 	unsigned long _address;
 	spinlock_t *ptl;
 	int node = NUMA_NO_NODE;
@@ -2657,8 +2726,10 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	pmd = mm_find_pmd(mm, address);
-	if (!pmd)
+	if (!pmd) {
+		result = SCAN_PMD_NULL;
 		goto out;
+	}
 
 	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
@@ -2667,19 +2738,25 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		pte_t pteval = *_pte;
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			if (!userfaultfd_armed(vma) &&
-			    ++none_or_zero <= khugepaged_max_ptes_none)
+			    ++none_or_zero <= khugepaged_max_ptes_none) {
 				continue;
-			else
+			} else {
+				result = SCAN_EXCEED_NONE_PTE;
 				goto out_unmap;
+			}
 		}
-		if (!pte_present(pteval))
+		if (!pte_present(pteval)) {
+			result = SCAN_PTE_NON_PRESENT;
 			goto out_unmap;
+		}
 		if (pte_write(pteval))
 			writable = true;
 
 		page = vm_normal_page(vma, _address, pteval);
-		if (unlikely(!page))
+		if (unlikely(!page)) {
+			result = SCAN_PAGE_NULL;
 			goto out_unmap;
+		}
 		/*
 		 * Record which node the original page is from and save this
 		 * information to khugepaged_node_load[].
@@ -2687,26 +2764,49 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		 * hit record.
 		 */
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node))
+		if (khugepaged_scan_abort(node)) {
+			result = SCAN_SCAN_ABORT;
 			goto out_unmap;
+		}
 		khugepaged_node_load[node]++;
 		VM_BUG_ON_PAGE(PageCompound(page), page);
-		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
+		if (!PageLRU(page)) {
+		result = SCAN_SCAN_ABORT;
+			goto out_unmap;
+		}
+		if (PageLocked(page)) {
+			result = SCAN_PAGE_LOCK;
 			goto out_unmap;
+		}
+		if (!PageAnon(page)) {
+			result = SCAN_PAGE_ANON;
+			goto out_unmap;
+		}
+
 		/*
 		 * cannot use mapcount: can't collapse if there's a gup pin.
 		 * The page must only be referenced by the scanned process
 		 * and page swap cache.
 		 */
-		if (page_count(page) != 1 + !!PageSwapCache(page))
+		if (page_count(page) != 1 + !!PageSwapCache(page)) {
+			result = SCAN_PAGE_COUNT;
 			goto out_unmap;
+		}
 		if (pte_young(pteval) ||
 		    page_is_young(page) || PageReferenced(page) ||
 		    mmu_notifier_test_young(vma->vm_mm, address))
 			referenced = true;
 	}
-	if (referenced && writable)
-		ret = 1;
+	if (writable) {
+		if (referenced) {
+			result = SCAN_SUCCEED;
+			ret = 1;
+		} else {
+			result = SCAN_NO_REFERENCED_PAGE;
+		}
+	} else {
+		result = SCAN_PAGE_RO;
+	}
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {
@@ -2715,6 +2815,8 @@ out_unmap:
 		collapse_huge_page(mm, address, hpage, vma, node);
 	}
 out:
+	trace_mm_khugepaged_scan_pmd(mm, page_to_pfn(page), writable, referenced,
+				     none_or_zero, result);
 	return ret;
 }
 
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC v5 2/3] mm: make optimistic check for swapin readahead
  2015-09-14 19:31 [RFC v5 0/3] mm: make swapin readahead to gain more thp performance Ebru Akagunduz
  2015-09-14 19:31 ` [RFC v5 1/3] mm: add tracepoint for scanning pages Ebru Akagunduz
@ 2015-09-14 19:31 ` Ebru Akagunduz
  2015-09-14 19:47   ` Rik van Riel
  2015-09-14 21:33   ` Andrew Morton
  2015-09-14 19:31 ` [RFC v5 3/3] mm: make swapin readahead to improve thp collapse rate Ebru Akagunduz
  2015-09-14 21:41 ` [RFC v5 0/3] mm: make swapin readahead to gain more thp performance Andrew Morton
  3 siblings, 2 replies; 17+ messages in thread
From: Ebru Akagunduz @ 2015-09-14 19:31 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, kirill.shutemov, n-horiguchi, aarcange, riel,
	iamjoonsoo.kim, xiexiuqi, gorcunov, linux-kernel, mgorman,
	rientjes, vbabka, aneesh.kumar, hughd, hannes, mhocko, boaz,
	raindel, Ebru Akagunduz

This patch introduces new sysfs integer knob
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
which makes optimistic check for swapin readahead to
increase thp collapse rate. Before getting swapped
out pages to memory, checks them and allows up to a
certain number. It also prints out using tracepoints
amount of unmapped ptes.

Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
---
Changes in v2:
 - Nothing changed

Changes in v3:
 - Define constant for exact tracepoint result (Vlastimil Babka)

Changes in v4:
 - Add sysfs knob request (Kirill A. Shutemov)

Changes in v5:
 - Rename MM_EXCEED_SWAP_PTE with SCAN_EXCEED_SWAP_PTE (Vlastimil Babka)
 - Add tracepoint macros to print string (Vlastimil Babka)

 include/trace/events/huge_memory.h | 14 +++++++-----
 mm/huge_memory.c                   | 45 +++++++++++++++++++++++++++++++++++---
 2 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 1df9bf5..153274c 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -29,7 +29,8 @@
 	EM( SCAN_SWAP_CACHE_PAGE,	"page_swap_cache")		\
 	EM( SCAN_DEL_PAGE_LRU,		"could_not_delete_page_from_lru")\
 	EM( SCAN_ALLOC_HUGE_PAGE_FAIL,	"alloc_huge_page_failed")	\
-	EMe( SCAN_CGROUP_CHARGE_FAIL,	"ccgroup_charge_failed")
+	EM( SCAN_CGROUP_CHARGE_FAIL,	"ccgroup_charge_failed")	\
+	EMe( SCAN_EXCEED_SWAP_PTE,	"exceed_swap_pte")
 
 #undef EM
 #undef EMe
@@ -46,9 +47,9 @@ SCAN_STATUS
 TRACE_EVENT(mm_khugepaged_scan_pmd,
 
 	TP_PROTO(struct mm_struct *mm, unsigned long pfn, bool writable,
-		 bool referenced, int none_or_zero, int status),
+		 bool referenced, int none_or_zero, int status, int unmapped),
 
-	TP_ARGS(mm, pfn, writable, referenced, none_or_zero, status),
+	TP_ARGS(mm, pfn, writable, referenced, none_or_zero, status, unmapped),
 
 	TP_STRUCT__entry(
 		__field(struct mm_struct *, mm)
@@ -57,6 +58,7 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
 		__field(bool, referenced)
 		__field(int, none_or_zero)
 		__field(int, status)
+		__field(int, unmapped)
 	),
 
 	TP_fast_assign(
@@ -66,15 +68,17 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
 		__entry->referenced = referenced;
 		__entry->none_or_zero = none_or_zero;
 		__entry->status = status;
+		__entry->unmapped = unmapped;
 	),
 
-	TP_printk("mm=%p, scan_pfn=0x%lx, writable=%d, referenced=%d, none_or_zero=%d, status=%s",
+	TP_printk("mm=%p, scan_pfn=0x%lx, writable=%d, referenced=%d, none_or_zero=%d, status=%s, unmapped=%d",
 		__entry->mm,
 		__entry->pfn,
 		__entry->writable,
 		__entry->referenced,
 		__entry->none_or_zero,
-		__print_symbolic(__entry->status, SCAN_STATUS))
+		__print_symbolic(__entry->status, SCAN_STATUS),
+		__entry->unmapped)
 );
 
 TRACE_EVENT(mm_collapse_huge_page,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4215cee..049b0db 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -26,6 +26,7 @@
 #include <linux/hashtable.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/page_idle.h>
+#include <linux/swapops.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -52,7 +53,8 @@ enum scan_result {
 	SCAN_SWAP_CACHE_PAGE,
 	SCAN_DEL_PAGE_LRU,
 	SCAN_ALLOC_HUGE_PAGE_FAIL,
-	SCAN_CGROUP_CHARGE_FAIL
+	SCAN_CGROUP_CHARGE_FAIL,
+	SCAN_EXCEED_SWAP_PTE
 };
 
 #define CREATE_TRACE_POINTS
@@ -94,6 +96,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
  * fault.
  */
 static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
+static unsigned int khugepaged_max_ptes_swap __read_mostly = HPAGE_PMD_NR/8;
 
 static int khugepaged(void *none);
 static int khugepaged_slab_init(void);
@@ -580,6 +583,33 @@ static struct kobj_attribute khugepaged_max_ptes_none_attr =
 	__ATTR(max_ptes_none, 0644, khugepaged_max_ptes_none_show,
 	       khugepaged_max_ptes_none_store);
 
+static ssize_t khugepaged_max_ptes_swap_show(struct kobject *kobj,
+					     struct kobj_attribute *attr,
+					     char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_max_ptes_swap);
+}
+
+static ssize_t khugepaged_max_ptes_swap_store(struct kobject *kobj,
+					      struct kobj_attribute *attr,
+					      const char *buf, size_t count)
+{
+	int err;
+	unsigned long max_ptes_swap;
+
+	err  = kstrtoul(buf, 10, &max_ptes_swap);
+	if (err || max_ptes_swap > HPAGE_PMD_NR-1)
+		return -EINVAL;
+
+	khugepaged_max_ptes_swap = max_ptes_swap;
+
+	return count;
+}
+
+static struct kobj_attribute khugepaged_max_ptes_swap_attr =
+	__ATTR(max_ptes_swap, 0644, khugepaged_max_ptes_swap_show,
+	       khugepaged_max_ptes_swap_store);
+
 static struct attribute *khugepaged_attr[] = {
 	&khugepaged_defrag_attr.attr,
 	&khugepaged_max_ptes_none_attr.attr,
@@ -588,6 +618,7 @@ static struct attribute *khugepaged_attr[] = {
 	&full_scans_attr.attr,
 	&scan_sleep_millisecs_attr.attr,
 	&alloc_sleep_millisecs_attr.attr,
+	&khugepaged_max_ptes_swap_attr.attr,
 	NULL,
 };
 
@@ -2720,7 +2751,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	struct page *page = NULL;
 	unsigned long _address;
 	spinlock_t *ptl;
-	int node = NUMA_NO_NODE;
+	int node = NUMA_NO_NODE, unmapped = 0;
 	bool writable = false, referenced = false;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
@@ -2736,6 +2767,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
+		if (is_swap_pte(pteval)) {
+			if (++unmapped <= khugepaged_max_ptes_swap) {
+				continue;
+			} else {
+				ret = SCAN_EXCEED_SWAP_PTE;
+				goto out_unmap;
+			}
+		}
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			if (!userfaultfd_armed(vma) &&
 			    ++none_or_zero <= khugepaged_max_ptes_none) {
@@ -2816,7 +2855,7 @@ out_unmap:
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, page_to_pfn(page), writable, referenced,
-				     none_or_zero, result);
+				     none_or_zero, result, unmapped);
 	return ret;
 }
 
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC v5 2/3] mm: make optimistic check for swapin readahead
  2015-09-14 19:31 ` [RFC v5 2/3] mm: make optimistic check for swapin readahead Ebru Akagunduz
@ 2015-09-14 19:47   ` Rik van Riel
  2015-09-14 21:33   ` Andrew Morton
  1 sibling, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2015-09-14 19:47 UTC (permalink / raw)
  To: Ebru Akagunduz, linux-mm
  Cc: akpm, kirill.shutemov, n-horiguchi, aarcange, iamjoonsoo.kim,
	xiexiuqi, gorcunov, linux-kernel, mgorman, rientjes, vbabka,
	aneesh.kumar, hughd, hannes, mhocko, boaz, raindel

On 09/14/2015 03:31 PM, Ebru Akagunduz wrote:
> This patch introduces new sysfs integer knob
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
> which makes optimistic check for swapin readahead to
> increase thp collapse rate. Before getting swapped
> out pages to memory, checks them and allows up to a
> certain number. It also prints out using tracepoints
> amount of unmapped ptes.

This may need some more refinement in the future, but your
patch series seems to create a large improvement over what
we have now.

> Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v5 2/3] mm: make optimistic check for swapin readahead
  2015-09-14 19:31 ` [RFC v5 2/3] mm: make optimistic check for swapin readahead Ebru Akagunduz
  2015-09-14 19:47   ` Rik van Riel
@ 2015-09-14 21:33   ` Andrew Morton
  2015-09-15 20:08     ` Ebru Akagunduz
  1 sibling, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2015-09-14 21:33 UTC (permalink / raw)
  To: Ebru Akagunduz
  Cc: linux-mm, kirill.shutemov, n-horiguchi, aarcange, riel,
	iamjoonsoo.kim, xiexiuqi, gorcunov, linux-kernel, mgorman,
	rientjes, vbabka, aneesh.kumar, hughd, hannes, mhocko, boaz,
	raindel

On Mon, 14 Sep 2015 22:31:44 +0300 Ebru Akagunduz <ebru.akagunduz@gmail.com> wrote:

> This patch introduces new sysfs integer knob
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
> which makes optimistic check for swapin readahead to
> increase thp collapse rate. Before getting swapped
> out pages to memory, checks them and allows up to a
> certain number. It also prints out using tracepoints
> amount of unmapped ptes.

We we please get this control documented? 
Documentation/vm/transhuge.txt appears to be the place for it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v5 2/3] mm: make optimistic check for swapin readahead
  2015-09-14 21:33   ` Andrew Morton
@ 2015-09-15 20:08     ` Ebru Akagunduz
  0 siblings, 0 replies; 17+ messages in thread
From: Ebru Akagunduz @ 2015-09-15 20:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, kirill.shutemov, n-horiguchi, aarcange, riel,
	iamjoonsoo.kim, xiexiuqi, gorcunov, linux-kernel, mgorman,
	rientjes, vbabka, aneesh.kumar, hughd, hannes, mhocko, boaz,
	raindel

On Mon, Sep 14, 2015 at 02:33:55PM -0700, Andrew Morton wrote:
> On Mon, 14 Sep 2015 22:31:44 +0300 Ebru Akagunduz <ebru.akagunduz@gmail.com> wrote:
> 
> > This patch introduces new sysfs integer knob
> > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
> > which makes optimistic check for swapin readahead to
> > increase thp collapse rate. Before getting swapped
> > out pages to memory, checks them and allows up to a
> > certain number. It also prints out using tracepoints
> > amount of unmapped ptes.
> 
> We we please get this control documented? 
> Documentation/vm/transhuge.txt appears to be the place for it.

I will add annotation about max_swap_ptes to doc and send it with new patch.

Kind regards,
Ebru

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC v5 3/3] mm: make swapin readahead to improve thp collapse rate
  2015-09-14 19:31 [RFC v5 0/3] mm: make swapin readahead to gain more thp performance Ebru Akagunduz
  2015-09-14 19:31 ` [RFC v5 1/3] mm: add tracepoint for scanning pages Ebru Akagunduz
  2015-09-14 19:31 ` [RFC v5 2/3] mm: make optimistic check for swapin readahead Ebru Akagunduz
@ 2015-09-14 19:31 ` Ebru Akagunduz
  2015-09-17 13:28   ` Kirill A. Shutemov
  2015-09-17 15:13   ` Kirill A. Shutemov
  2015-09-14 21:41 ` [RFC v5 0/3] mm: make swapin readahead to gain more thp performance Andrew Morton
  3 siblings, 2 replies; 17+ messages in thread
From: Ebru Akagunduz @ 2015-09-14 19:31 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, kirill.shutemov, n-horiguchi, aarcange, riel,
	iamjoonsoo.kim, xiexiuqi, gorcunov, linux-kernel, mgorman,
	rientjes, vbabka, aneesh.kumar, hughd, hannes, mhocko, boaz,
	raindel, Ebru Akagunduz

This patch makes swapin readahead to improve thp collapse rate.
When khugepaged scanned pages, there can be a few of the pages
in swap area.

With the patch THP can collapse 4kB pages into a THP when
there are up to max_ptes_swap swap ptes in a 2MB range.

The patch was tested with a test program that allocates
400B of memory, writes to it, and then sleeps. I force
the system to swap out all. Afterwards, the test program
touches the area by writing, it skips a page in each
20 pages of the area.

Without the patch, system did not swap in readahead.
THP rate was %65 of the program of the memory, it
did not change over time.

With this patch, after 10 minutes of waiting khugepaged had
collapsed %99 of the program's memory.

Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
Changes in v2:
 - Use FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT flag
   instead of 0x0 when called do_swap_page from
   __collapse_huge_page_swapin (Rik van Riel)

Changes in v3:
 - Catch VM_FAULT_HWPOISON and VM_FAULT_OOM return cases
   in __collapse_huge_page_swapin (Kirill A. Shutemov)

Changes in v4:
 - Fix broken indentation reverting if (...) statement in
   __collapse_huge_page_swapin (Kirill A. Shutemov)
 - Fix check statement of ret (Kirill A. Shutemov)
 - Use swapped_in name instead of swap_pte

Changes in v5:
 - Export do_swap_page in mm/internal.h instead outside
   of mm (Vlastimil Babka)

Test results:

                        After swapped out
-------------------------------------------------------------------
              | Anonymous | AnonHugePages | Swap      | Fraction  |
-------------------------------------------------------------------
With patch    | 90076 kB    | 88064 kB    | 309928 kB |    %99    |
-------------------------------------------------------------------
Without patch | 194068 kB | 192512 kB     | 205936 kB |    %99    |
-------------------------------------------------------------------

                        After swapped in
-------------------------------------------------------------------
              | Anonymous | AnonHugePages | Swap      | Fraction  |
-------------------------------------------------------------------
With patch    | 201408 kB | 198656 kB     | 198596 kB |    %98    |
-------------------------------------------------------------------
Without patch | 292624 kB | 192512 kB     | 107380 kB |    %65    |
-------------------------------------------------------------------

 include/trace/events/huge_memory.h | 24 +++++++++++++++++++++
 mm/huge_memory.c                   | 43 ++++++++++++++++++++++++++++++++++++++
 mm/internal.h                      |  4 ++++
 mm/memory.c                        |  2 +-
 4 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 153274c..1efc7f1 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -136,6 +136,30 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
 		__print_symbolic(__entry->status, SCAN_STATUS))
 );
 
+TRACE_EVENT(mm_collapse_huge_page_swapin,
+
+	TP_PROTO(struct mm_struct *mm, int swapped_in, int ret),
+
+	TP_ARGS(mm, swapped_in, ret),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(int, swapped_in)
+		__field(int, ret)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->swapped_in = swapped_in;
+		__entry->ret = ret;
+	),
+
+	TP_printk("mm=%p, swapped_in=%d, ret=%d",
+		__entry->mm,
+		__entry->swapped_in,
+		__entry->ret)
+);
+
 #endif /* __HUGE_MEMORY_H */
 #include <trace/define_trace.h>
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 049b0db..e83f20a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2584,6 +2584,47 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
 	return true;
 }
 
+/*
+ * Bring missing pages in from swap, to complete THP collapse.
+ * Only done if khugepaged_scan_pmd believes it is worthwhile.
+ *
+ * Called and returns without pte mapped or spinlocks held,
+ * but with mmap_sem held to protect against vma changes.
+ */
+
+static void __collapse_huge_page_swapin(struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long address, pmd_t *pmd,
+					pte_t *pte)
+{
+	unsigned long _address;
+	pte_t pteval = *pte;
+	int swapped_in = 0, ret = 0;
+
+	pte = pte_offset_map(pmd, address);
+	for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
+	     pte++, _address += PAGE_SIZE) {
+		pteval = *pte;
+		if (!is_swap_pte(pteval))
+			continue;
+		swapped_in++;
+		ret = do_swap_page(mm, vma, _address, pte, pmd,
+				   FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
+				   pteval);
+		if (ret & VM_FAULT_ERROR) {
+			trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
+			return;
+		}
+		/* pte is unmapped now, we need to map it */
+		pte = pte_offset_map(pmd, _address);
+	}
+	pte--;
+	pte_unmap(pte);
+	trace_mm_collapse_huge_page_swapin(mm, swapped_in, 1);
+}
+
+
+
 static void collapse_huge_page(struct mm_struct *mm,
 				   unsigned long address,
 				   struct page **hpage,
@@ -2655,6 +2696,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	anon_vma_lock_write(vma->anon_vma);
 
+	__collapse_huge_page_swapin(mm, vma, address, pmd, pte);
+
 	pte = pte_offset_map(pmd, address);
 	pte_ptl = pte_lockptr(mm, pmd);
 
diff --git a/mm/internal.h b/mm/internal.h
index bc0fa9a..867ea14 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -14,6 +14,10 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 
+extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, pte_t *page_table, pmd_t *pmd,
+			unsigned int flags, pte_t orig_pte);
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/memory.c b/mm/memory.c
index 9cb2747..caecc64 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2441,7 +2441,7 @@ EXPORT_SYMBOL(unmap_mapping_range);
  * We return with the mmap_sem locked or unlocked in the same cases
  * as does filemap_fault().
  */
-static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
+int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
 		unsigned int flags, pte_t orig_pte)
 {
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC v5 3/3] mm: make swapin readahead to improve thp collapse rate
  2015-09-14 19:31 ` [RFC v5 3/3] mm: make swapin readahead to improve thp collapse rate Ebru Akagunduz
@ 2015-09-17 13:28   ` Kirill A. Shutemov
  2015-09-17 15:13   ` Kirill A. Shutemov
  1 sibling, 0 replies; 17+ messages in thread
From: Kirill A. Shutemov @ 2015-09-17 13:28 UTC (permalink / raw)
  To: Ebru Akagunduz
  Cc: linux-mm, akpm, kirill.shutemov, n-horiguchi, aarcange, riel,
	iamjoonsoo.kim, xiexiuqi, gorcunov, linux-kernel, mgorman,
	rientjes, vbabka, aneesh.kumar, hughd, hannes, mhocko, boaz,
	raindel

On Mon, Sep 14, 2015 at 10:31:45PM +0300, Ebru Akagunduz wrote:
> @@ -2655,6 +2696,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>  
>  	anon_vma_lock_write(vma->anon_vma);
>  
> +	__collapse_huge_page_swapin(mm, vma, address, pmd, pte);
> +

Do I miss something, or 'pte' is not initialized at this point?
And the value is not really used in __collapse_huge_page_swapin().

>  	pte = pte_offset_map(pmd, address);
>  	pte_ptl = pte_lockptr(mm, pmd);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v5 3/3] mm: make swapin readahead to improve thp collapse rate
  2015-09-14 19:31 ` [RFC v5 3/3] mm: make swapin readahead to improve thp collapse rate Ebru Akagunduz
  2015-09-17 13:28   ` Kirill A. Shutemov
@ 2015-09-17 15:13   ` Kirill A. Shutemov
  1 sibling, 0 replies; 17+ messages in thread
From: Kirill A. Shutemov @ 2015-09-17 15:13 UTC (permalink / raw)
  To: Ebru Akagunduz
  Cc: linux-mm, akpm, kirill.shutemov, n-horiguchi, aarcange, riel,
	iamjoonsoo.kim, xiexiuqi, gorcunov, linux-kernel, mgorman,
	rientjes, vbabka, aneesh.kumar, hughd, hannes, mhocko, boaz,
	raindel

On Mon, Sep 14, 2015 at 10:31:45PM +0300, Ebru Akagunduz wrote:
> This patch makes swapin readahead to improve thp collapse rate.
> When khugepaged scanned pages, there can be a few of the pages
> in swap area.
> 
> With the patch THP can collapse 4kB pages into a THP when
> there are up to max_ptes_swap swap ptes in a 2MB range.
> 
> The patch was tested with a test program that allocates
> 400B of memory, writes to it, and then sleeps. I force
> the system to swap out all. Afterwards, the test program
> touches the area by writing, it skips a page in each
> 20 pages of the area.
> 
> Without the patch, system did not swap in readahead.
> THP rate was %65 of the program of the memory, it
> did not change over time.
> 
> With this patch, after 10 minutes of waiting khugepaged had
> collapsed %99 of the program's memory.
> 
> Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
> Acked-by: Rik van Riel <riel@redhat.com>

[ resend with correct TO/CC lists]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v5 0/3] mm: make swapin readahead to gain more thp performance
  2015-09-14 19:31 [RFC v5 0/3] mm: make swapin readahead to gain more thp performance Ebru Akagunduz
                   ` (2 preceding siblings ...)
  2015-09-14 19:31 ` [RFC v5 3/3] mm: make swapin readahead to improve thp collapse rate Ebru Akagunduz
@ 2015-09-14 21:41 ` Andrew Morton
  2016-02-25  7:36   ` Hugh Dickins
  3 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2015-09-14 21:41 UTC (permalink / raw)
  To: Ebru Akagunduz
  Cc: linux-mm, kirill.shutemov, n-horiguchi, aarcange, riel,
	iamjoonsoo.kim, xiexiuqi, gorcunov, linux-kernel, mgorman,
	rientjes, vbabka, aneesh.kumar, hughd, hannes, mhocko, boaz,
	raindel

On Mon, 14 Sep 2015 22:31:42 +0300 Ebru Akagunduz <ebru.akagunduz@gmail.com> wrote:

> This patch series makes swapin readahead up to a
> certain number to gain more thp performance and adds
> tracepoint for khugepaged_scan_pmd, collapse_huge_page,
> __collapse_huge_page_isolate.

I'll merge this series for testing.  Hopefully Andrea and/or Hugh will
find time for a quality think about the issue before 4.3 comes around.

It would be much better if we didn't have that sysfs knob - make the
control automatic in some fashion.

If we can't think of a way of doing that then at least let's document
max_ptes_swap very carefully.  Explain to our users what it does, why
they should care about it, how they should set about determining (ie:
measuring) its effect upon their workloads.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v5 0/3] mm: make swapin readahead to gain more thp performance
  2015-09-14 21:41 ` [RFC v5 0/3] mm: make swapin readahead to gain more thp performance Andrew Morton
@ 2016-02-25  7:36   ` Hugh Dickins
  2016-02-25 22:35     ` Rik van Riel
  2016-02-25 23:16     ` Ebru Akagunduz
  0 siblings, 2 replies; 17+ messages in thread
From: Hugh Dickins @ 2016-02-25  7:36 UTC (permalink / raw)
  To: Ebru Akagunduz
  Cc: Andrew Morton, linux-mm, kirill.shutemov, n-horiguchi, aarcange,
	riel, iamjoonsoo.kim, xiexiuqi, gorcunov, linux-kernel, mgorman,
	rientjes, vbabka, aneesh.kumar, hughd, hannes, mhocko, boaz,
	raindel

On Mon, 14 Sep 2015, Andrew Morton wrote:
> On Mon, 14 Sep 2015 22:31:42 +0300 Ebru Akagunduz <ebru.akagunduz@gmail.com> wrote:
> 
> > This patch series makes swapin readahead up to a
> > certain number to gain more thp performance and adds
> > tracepoint for khugepaged_scan_pmd, collapse_huge_page,
> > __collapse_huge_page_isolate.
> 
> I'll merge this series for testing.  Hopefully Andrea and/or Hugh will
> find time for a quality think about the issue before 4.3 comes around.
> 
> It would be much better if we didn't have that sysfs knob - make the
> control automatic in some fashion.
> 
> If we can't think of a way of doing that then at least let's document
> max_ptes_swap very carefully.  Explain to our users what it does, why
> they should care about it, how they should set about determining (ie:
> measuring) its effect upon their workloads.

Ebru, I don't know whether you realize, but your THP swapin work has
been languishing in mmotm for five months now, without getting any
nearer to Linus's tree.

That's partly my fault - sorry - for not responding to Andrew's nudge
above.  But I think you also got caught up in conference, and in the
end did not get around to answering outstanding issues: please take a
look at your mailbox from last September, to see what more is needed.

Here's what mmotm's series file says...

#mm-add-tracepoint-for-scanning-pages.patch+2: Andrea/Hugh review?. 2 Fengguang warnings, one "kernel test robot" oops
#mm-make-optimistic-check-for-swapin-readahead.patch: TBU (docs)
mm-make-optimistic-check-for-swapin-readahead.patch
mm-make-optimistic-check-for-swapin-readahead-fix-2.patch
#mm-make-swapin-readahead-to-improve-thp-collapse-rate.patch: Hugh/Kirill want collapse_huge_page() rework
mm-make-swapin-readahead-to-improve-thp-collapse-rate.patch
mm-make-swapin-readahead-to-improve-thp-collapse-rate-fix.patch
mm-make-swapin-readahead-to-improve-thp-collapse-rate-fix-2.patch
#mm-make-swapin-readahead-to-improve-thp-collapse-rate-fix-3.patch: Ebru to test?
mm-make-swapin-readahead-to-improve-thp-collapse-rate-fix-3.patch

...but I think some of that is stale.  There were a few little bugs
when it first went into mmotm, which Kirill very swiftly fixed up,
and I don't think it has given anybody any trouble since then.

But do I want to see this work go in?  Yes and no.  The problem it
fixes (that although we give out a THP to someone who faults a single
page of it, after swapout the THP cannot be recovered until they have
faulted in every page of it) is real and embarrassing; the code is good;
and I don't mind the max_ptes_swap tunable that concerns Andrew above;
but Kirill and Vlastimil made important points that still trouble me.

I can't locate Kirill's mail right now, perhaps I'm misremembering:
but wasn't he concerned by your __collapse_huge_page_swapin() (likely
to be allocating many small pages) being called under down_write of
mmap_sem?  That's usually something we soon regret, and even down_read
of mmap_sem across many memory allocations would be unfortunate
(khugepaged used to allocate its THP that way, but we have
Vlastimil to thank for stopping that in his 8b1645685acf).

And didn't Vlastimil (9/4/15) make some other unanswered
observations about the call to __collapse_huge_page_swapin():

> Hmm it seems rather wasteful to call this when no swap entries were detected.
> Also it seems pointless to try continue collapsing when we have just only issued
> async swap-in? What are the chances they would finish in time?
> 
> I'm less sure about the relation vs khugepaged_alloc_page(). At this point, we
> have already succeeded the hugepage allocation. It makes sense not to swap-in if
> we can't allocate a hugepage. It makes also sense not to allocate a hugepage if
> we will just issue async swap-ins and then free the hugepage back. Swap-in means
> disk I/O that's best avoided if not useful. But the reclaim for hugepage
> allocation might also involve disk I/O. At worst, it could be creating new swap
> pte's in the very pmd we are scanning... Thoughts?

Doesn't this imply that __collapse_huge_page_swapin() will initiate all
the necessary swapins for a THP, then (given the FAULT_FLAG_ALLOW_RETRY)
not wait for them to complete, so khugepaged will give up on that extent
and move on to another; then after another full circuit of all the mms
it needs to examine, it will arrive back at this extent and build a THP
from the swapins it arranged last time.

Which may work well when a system transitions from busy+swappingout
to idle+swappingin, but isn't that rather a special case?  It feels
(meaning, I've not measured at all) as if the inbetween busyish case
will waste a lot of I/O and memory on swapins that have to be discarded
again before khugepaged has made its sedate way back to slotting them in.

So I wonder how useful this is in its present form.  The problem being,
not with your code as such, but the whole nature of khugepaged.  When
I had to solve a similar problem with recovering huge tmpfs pages (not
yet posted), I did briefly consider whether to hook in to use khugepaged;
but rejected that, and have never regretted using a workqueue item for
the extent instead.  Did Vlastimil (argh, him again!) propose something
similar to replace khugepaged?  Or should khugepaged fire off workqueue
items for THP extents needing swapin?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v5 0/3] mm: make swapin readahead to gain more thp performance
  2016-02-25  7:36   ` Hugh Dickins
@ 2016-02-25 22:35     ` Rik van Riel
  2016-02-25 23:30       ` Ebru Akagunduz
  2016-02-25 23:16     ` Ebru Akagunduz
  1 sibling, 1 reply; 17+ messages in thread
From: Rik van Riel @ 2016-02-25 22:35 UTC (permalink / raw)
  To: Hugh Dickins, Ebru Akagunduz
  Cc: Andrew Morton, linux-mm, kirill.shutemov, n-horiguchi, aarcange,
	iamjoonsoo.kim, xiexiuqi, gorcunov, linux-kernel, mgorman,
	rientjes, vbabka, aneesh.kumar, hannes, mhocko, boaz, raindel

[-- Attachment #1: Type: text/plain, Size: 1555 bytes --]

On Wed, 2016-02-24 at 23:36 -0800, Hugh Dickins wrote:
> 
> Doesn't this imply that __collapse_huge_page_swapin() will initiate
> all
> the necessary swapins for a THP, then (given the
> FAULT_FLAG_ALLOW_RETRY)
> not wait for them to complete, so khugepaged will give up on that
> extent
> and move on to another; then after another full circuit of all the
> mms
> it needs to examine, it will arrive back at this extent and build a
> THP
> from the swapins it arranged last time.
> 
> Which may work well when a system transitions from busy+swappingout
> to idle+swappingin, but isn't that rather a special case?  It feels
> (meaning, I've not measured at all) as if the inbetween busyish case
> will waste a lot of I/O and memory on swapins that have to be
> discarded
> again before khugepaged has made its sedate way back to slotting them
> in.
> 

There may be a fairly simple way to prevent
that from becoming an issue.

When khugepaged wakes up, it can check the
PGSWPOUT or even the PGSTEAL_* stats for
the system, and skip swapin readahead if
there was swapout activity (or any page
reclaim activity?) since the time it last
ran.

That way the swapin readahead will do
its thing when transitioning from
busy + swapout to idle + swapin, but not
while the system is under permanent memory
pressure.

Am I forgetting anything obvious?

Is this too aggressive?

Not aggressive enough?

Could PGPGOUT + PGSWPOUT be a useful
in-between between just PGSWPOUT or
PGSTEAL_*?

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v5 0/3] mm: make swapin readahead to gain more thp performance
  2016-02-25 22:35     ` Rik van Riel
@ 2016-02-25 23:30       ` Ebru Akagunduz
  2016-02-26  6:17         ` Hugh Dickins
  0 siblings, 1 reply; 17+ messages in thread
From: Ebru Akagunduz @ 2016-02-25 23:30 UTC (permalink / raw)
  To: linux-mm, riel, hughd
  Cc: akpm, kirill.shutemov, n-horiguchi, aarcange, iamjoonsoo.kim,
	xiexiuqi, gorcunov, linux-kernel, mgorman, rientjes, vbabka,
	aneesh.kumar, hannes, mhocko, boaz, raindel

in Thu, Feb 25, 2016 at 05:35:50PM -0500, Rik van Riel wrote:
> On Wed, 2016-02-24 at 23:36 -0800, Hugh Dickins wrote:
> > 
> > Doesn't this imply that __collapse_huge_page_swapin() will initiate
> > all
> > the necessary swapins for a THP, then (given the
> > FAULT_FLAG_ALLOW_RETRY)
> > not wait for them to complete, so khugepaged will give up on that
> > extent
> > and move on to another; then after another full circuit of all the
> > mms
> > it needs to examine, it will arrive back at this extent and build a
> > THP
> > from the swapins it arranged last time.
> > 
> > Which may work well when a system transitions from busy+swappingout
> > to idle+swappingin, but isn't that rather a special case?  It feels
> > (meaning, I've not measured at all) as if the inbetween busyish case
> > will waste a lot of I/O and memory on swapins that have to be
> > discarded
> > again before khugepaged has made its sedate way back to slotting them
> > in.
> > 
> 
> There may be a fairly simple way to prevent
> that from becoming an issue.
> 
> When khugepaged wakes up, it can check the
> PGSWPOUT or even the PGSTEAL_* stats for
> the system, and skip swapin readahead if
> there was swapout activity (or any page
> reclaim activity?) since the time it last
> ran.
> 
> That way the swapin readahead will do
> its thing when transitioning from
> busy + swapout to idle + swapin, but not
> while the system is under permanent memory
> pressure.
> 
The idea make sense for me.
> Am I forgetting anything obvious?
> 
> Is this too aggressive?
> 
> Not aggressive enough?
> 
> Could PGPGOUT + PGSWPOUT be a useful
> in-between between just PGSWPOUT or
> PGSTEAL_*?
> 
> -- 
> All rights reversed


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v5 0/3] mm: make swapin readahead to gain more thp performance
  2016-02-25 23:30       ` Ebru Akagunduz
@ 2016-02-26  6:17         ` Hugh Dickins
  2016-02-26 14:51           ` Rik van Riel
  0 siblings, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2016-02-26  6:17 UTC (permalink / raw)
  To: Ebru Akagunduz
  Cc: linux-mm, riel, hughd, akpm, kirill.shutemov, n-horiguchi,
	aarcange, iamjoonsoo.kim, xiexiuqi, gorcunov, linux-kernel,
	mgorman, rientjes, vbabka, aneesh.kumar, hannes, mhocko, boaz,
	raindel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2689 bytes --]

On Fri, 26 Feb 2016, Ebru Akagunduz wrote:
> in Thu, Feb 25, 2016 at 05:35:50PM -0500, Rik van Riel wrote:
> > On Wed, 2016-02-24 at 23:36 -0800, Hugh Dickins wrote:
> > > 
> > > Doesn't this imply that __collapse_huge_page_swapin() will initiate
> > > all
> > > the necessary swapins for a THP, then (given the
> > > FAULT_FLAG_ALLOW_RETRY)
> > > not wait for them to complete, so khugepaged will give up on that
> > > extent
> > > and move on to another; then after another full circuit of all the
> > > mms
> > > it needs to examine, it will arrive back at this extent and build a
> > > THP
> > > from the swapins it arranged last time.
> > > 
> > > Which may work well when a system transitions from busy+swappingout
> > > to idle+swappingin, but isn't that rather a special case?  It feels
> > > (meaning, I've not measured at all) as if the inbetween busyish case
> > > will waste a lot of I/O and memory on swapins that have to be
> > > discarded
> > > again before khugepaged has made its sedate way back to slotting them
> > > in.
> > > 
> > 
> > There may be a fairly simple way to prevent
> > that from becoming an issue.
> > 
> > When khugepaged wakes up, it can check the
> > PGSWPOUT or even the PGSTEAL_* stats for
> > the system, and skip swapin readahead if
> > there was swapout activity (or any page
> > reclaim activity?) since the time it last
> > ran.
> > 
> > That way the swapin readahead will do
> > its thing when transitioning from
> > busy + swapout to idle + swapin, but not
> > while the system is under permanent memory
> > pressure.
> > 
> The idea make sense for me.

Yes, it does sound a promising approach: please give it a try.

> > Am I forgetting anything obvious?
> > 
> > Is this too aggressive?
> > 
> > Not aggressive enough?
> > 
> > Could PGPGOUT + PGSWPOUT be a useful
> > in-between between just PGSWPOUT or
> > PGSTEAL_*?

I've no idea offhand, would have to study what each of those
actually means: I'm really not familiar with them myself.

I did wonder whether to suggest using swapin_readahead_hits
instead, but there's probably several reasons why that would
be a bad idea (its volatility, its intent for a different and
private purpose, and perhaps an inappropriate feedback effect -
the swap pages of a split THP are much more likely to be adjacent
than usually happens, so readahead probably pays off well for them,
which is good, but should not feed back into the decision).

There is also a question of where to position the test or tests:
allocating the THP, and allocating pages for swapin, will apply
their own pressure, in danger of generating swapout.

Hugh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v5 0/3] mm: make swapin readahead to gain more thp performance
  2016-02-26  6:17         ` Hugh Dickins
@ 2016-02-26 14:51           ` Rik van Riel
  2016-03-03 22:08             ` Ebru Akagunduz
  0 siblings, 1 reply; 17+ messages in thread
From: Rik van Riel @ 2016-02-26 14:51 UTC (permalink / raw)
  To: Hugh Dickins, Ebru Akagunduz
  Cc: linux-mm, akpm, kirill.shutemov, n-horiguchi, aarcange,
	iamjoonsoo.kim, xiexiuqi, gorcunov, linux-kernel, mgorman,
	rientjes, vbabka, aneesh.kumar, hannes, mhocko, boaz, raindel

[-- Attachment #1: Type: text/plain, Size: 1073 bytes --]

On Thu, 2016-02-25 at 22:17 -0800, Hugh Dickins wrote:
> On Fri, 26 Feb 2016, Ebru Akagunduz wrote:
> > in Thu, Feb 25, 2016 at 05:35:50PM -0500, Rik van Riel wrote:
> 
> > > Am I forgetting anything obvious?
> > > 
> > > Is this too aggressive?
> > > 
> > > Not aggressive enough?
> > > 
> > > Could PGPGOUT + PGSWPOUT be a useful
> > > in-between between just PGSWPOUT or
> > > PGSTEAL_*?
> 
> I've no idea offhand, would have to study what each of those
> actually means: I'm really not familiar with them myself.

There are a few levels of page reclaim activity:

PGSTEAL_* - any page was reclaimed, this could just
            be file pages for streaming file IO,etc

PGPGOUT   - the VM wrote pages back to disk to reclaim
            them, this could include file pages

PGSWPOUT  - the VM wrote something to swap to reclaim
            memory

I am not sure which level of aggressiveness khugepaged
should check against, but my gut instinct would probably
be the second or third.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v5 0/3] mm: make swapin readahead to gain more thp performance
  2016-02-26 14:51           ` Rik van Riel
@ 2016-03-03 22:08             ` Ebru Akagunduz
  0 siblings, 0 replies; 17+ messages in thread
From: Ebru Akagunduz @ 2016-03-03 22:08 UTC (permalink / raw)
  To: hughd, riel
  Cc: linux-mm, akpm, kirill.shutemov, n-horiguchi, aarcange,
	iamjoonsoo.kim, gorcunov, linux-kernel, mgorman, rientjes, vbabka,
	aneesh.kumar, hannes, mhocko, boaz, raindel

On Fri, Feb 26, 2016 at 09:51:56AM -0500, Rik van Riel wrote:
> On Thu, 2016-02-25 at 22:17 -0800, Hugh Dickins wrote:
> > On Fri, 26 Feb 2016, Ebru Akagunduz wrote:
> > > in Thu, Feb 25, 2016 at 05:35:50PM -0500, Rik van Riel wrote:
> > 
> > > > Am I forgetting anything obvious?
> > > > 
> > > > Is this too aggressive?
> > > > 
> > > > Not aggressive enough?
> > > > 
> > > > Could PGPGOUT + PGSWPOUT be a useful
> > > > in-between between just PGSWPOUT or
> > > > PGSTEAL_*?
> > 
> > I've no idea offhand, would have to study what each of those
> > actually means: I'm really not familiar with them myself.
> 
> There are a few levels of page reclaim activity:
> 
> PGSTEAL_* - any page was reclaimed, this could just
>             be file pages for streaming file IO,etc
> 
> PGPGOUT   - the VM wrote pages back to disk to reclaim
>             them, this could include file pages
> 
> PGSWPOUT  - the VM wrote something to swap to reclaim
>             memory
> 
> I am not sure which level of aggressiveness khugepaged
> should check against, but my gut instinct would probably
> be the second or third.

I tested with PGPGOUT, it does not help as I expect.
As Rik's suggestion, PSWPOUT and ALLOCSTALL can be good.

I started to prepare the patch last week. Just wanted to
make you sure.

Kind regards.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v5 0/3] mm: make swapin readahead to gain more thp performance
  2016-02-25  7:36   ` Hugh Dickins
  2016-02-25 22:35     ` Rik van Riel
@ 2016-02-25 23:16     ` Ebru Akagunduz
  1 sibling, 0 replies; 17+ messages in thread
From: Ebru Akagunduz @ 2016-02-25 23:16 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, linux-mm, kirill.shutemov, n-horiguchi, aarcange, riel,
	iamjoonsoo.kim, xiexiuqi, gorcunov, linux-kernel, mgorman,
	rientjes, vbabka, aneesh.kumar, hannes, mhocko, boaz, raindel

On Wed, Feb 24, 2016 at 11:36:30PM -0800, Hugh Dickins wrote:
> On Mon, 14 Sep 2015, Andrew Morton wrote:
> > On Mon, 14 Sep 2015 22:31:42 +0300 Ebru Akagunduz <ebru.akagunduz@gmail.com> wrote:
> > 
> > > This patch series makes swapin readahead up to a
> > > certain number to gain more thp performance and adds
> > > tracepoint for khugepaged_scan_pmd, collapse_huge_page,
> > > __collapse_huge_page_isolate.
> > 
> > I'll merge this series for testing.  Hopefully Andrea and/or Hugh will
> > find time for a quality think about the issue before 4.3 comes around.
> > 
> > It would be much better if we didn't have that sysfs knob - make the
> > control automatic in some fashion.
> > 
> > If we can't think of a way of doing that then at least let's document
> > max_ptes_swap very carefully.  Explain to our users what it does, why
> > they should care about it, how they should set about determining (ie:
> > measuring) its effect upon their workloads.
> 
> Ebru, I don't know whether you realize, but your THP swapin work has
> been languishing in mmotm for five months now, without getting any
> nearer to Linus's tree.
> 
> That's partly my fault - sorry - for not responding to Andrew's nudge
> above.  But I think you also got caught up in conference, and in the
> end did not get around to answering outstanding issues: please take a
> look at your mailbox from last September, to see what more is needed.
> 
I've seen my patch series in mmotm mails but I thought,
other parts of thp have problem so those are going to be
forwarded to Linus's tree when other parts fixed.

I did not know the file: http://www.ozlabs.org/~akpm/mmotm/series
It shows explicitly the problem of patches.
Thank you for summarizing it below.

> Here's what mmotm's series file says...
> 
> #mm-add-tracepoint-for-scanning-pages.patch+2: Andrea/Hugh review?. 2 Fengguang warnings, one "kernel test robot" oops
> #mm-make-optimistic-check-for-swapin-readahead.patch: TBU (docs)
I've sent doc patch: http://lkml.iu.edu/hypermail/linux/kernel/1509.2/01783.html
> mm-make-optimistic-check-for-swapin-readahead.patch
> mm-make-optimistic-check-for-swapin-readahead-fix-2.patch
> #mm-make-swapin-readahead-to-improve-thp-collapse-rate.patch: Hugh/Kirill want collapse_huge_page() rework
> mm-make-swapin-readahead-to-improve-thp-collapse-rate.patch
> mm-make-swapin-readahead-to-improve-thp-collapse-rate-fix.patch
> mm-make-swapin-readahead-to-improve-thp-collapse-rate-fix-2.patch
> #mm-make-swapin-readahead-to-improve-thp-collapse-rate-fix-3.patch: Ebru to test?
I've tested my whole patch series and could not produce the fault
again. I've also seen Tested-by tag from Sergey so I did not sent
the tag.
> mm-make-swapin-readahead-to-improve-thp-collapse-rate-fix-3.patch
> 
> ...but I think some of that is stale.  There were a few little bugs
> when it first went into mmotm, which Kirill very swiftly fixed up,
> and I don't think it has given anybody any trouble since then.
> 
> But do I want to see this work go in?  Yes and no.  The problem it
> fixes (that although we give out a THP to someone who faults a single
> page of it, after swapout the THP cannot be recovered until they have
> faulted in every page of it) is real and embarrassing; the code is good;
> and I don't mind the max_ptes_swap tunable that concerns Andrew above;
> but Kirill and Vlastimil made important points that still trouble me.
> 
> I can't locate Kirill's mail right now, perhaps I'm misremembering:
> but wasn't he concerned by your __collapse_huge_page_swapin() (likely
> to be allocating many small pages) being called under down_write of
> mmap_sem?  That's usually something we soon regret, and even down_read
> of mmap_sem across many memory allocations would be unfortunate
> (khugepaged used to allocate its THP that way, but we have
> Vlastimil to thank for stopping that in his 8b1645685acf).
> 
> And didn't Vlastimil (9/4/15) make some other unanswered
> observations about the call to __collapse_huge_page_swapin():
> 
> > Hmm it seems rather wasteful to call this when no swap entries were detected.
> > Also it seems pointless to try continue collapsing when we have just only issued
> > async swap-in? What are the chances they would finish in time?
> > 
> > I'm less sure about the relation vs khugepaged_alloc_page(). At this point, we
> > have already succeeded the hugepage allocation. It makes sense not to swap-in if
> > we can't allocate a hugepage. It makes also sense not to allocate a hugepage if
> > we will just issue async swap-ins and then free the hugepage back. Swap-in means
> > disk I/O that's best avoided if not useful. But the reclaim for hugepage
> > allocation might also involve disk I/O. At worst, it could be creating new swap
> > pte's in the very pmd we are scanning... Thoughts?
> 
I did not take enough responsibility, you're right. I should have
asked regarding the patch at least.
> Doesn't this imply that __collapse_huge_page_swapin() will initiate all
> the necessary swapins for a THP, then (given the FAULT_FLAG_ALLOW_RETRY)
> not wait for them to complete, so khugepaged will give up on that extent
> and move on to another; then after another full circuit of all the mms
> it needs to examine, it will arrive back at this extent and build a THP
> from the swapins it arranged last time.
> 
> Which may work well when a system transitions from busy+swappingout
> to idle+swappingin, but isn't that rather a special case?  It feels
> (meaning, I've not measured at all) as if the inbetween busyish case
> will waste a lot of I/O and memory on swapins that have to be discarded
> again before khugepaged has made its sedate way back to slotting them in.
> 
> So I wonder how useful this is in its present form.  The problem being,
> not with your code as such, but the whole nature of khugepaged.  When
> I had to solve a similar problem with recovering huge tmpfs pages (not
> yet posted), I did briefly consider whether to hook in to use khugepaged;
> but rejected that, and have never regretted using a workqueue item for
> the extent instead.  Did Vlastimil (argh, him again!) propose something
> similar to replace khugepaged?  Or should khugepaged fire off workqueue
> items for THP extents needing swapin?
> 
> Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-03-03 22:08 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-14 19:31 [RFC v5 0/3] mm: make swapin readahead to gain more thp performance Ebru Akagunduz
2015-09-14 19:31 ` [RFC v5 1/3] mm: add tracepoint for scanning pages Ebru Akagunduz
2015-09-14 19:31 ` [RFC v5 2/3] mm: make optimistic check for swapin readahead Ebru Akagunduz
2015-09-14 19:47   ` Rik van Riel
2015-09-14 21:33   ` Andrew Morton
2015-09-15 20:08     ` Ebru Akagunduz
2015-09-14 19:31 ` [RFC v5 3/3] mm: make swapin readahead to improve thp collapse rate Ebru Akagunduz
2015-09-17 13:28   ` Kirill A. Shutemov
2015-09-17 15:13   ` Kirill A. Shutemov
2015-09-14 21:41 ` [RFC v5 0/3] mm: make swapin readahead to gain more thp performance Andrew Morton
2016-02-25  7:36   ` Hugh Dickins
2016-02-25 22:35     ` Rik van Riel
2016-02-25 23:30       ` Ebru Akagunduz
2016-02-26  6:17         ` Hugh Dickins
2016-02-26 14:51           ` Rik van Riel
2016-03-03 22:08             ` Ebru Akagunduz
2016-02-25 23:16     ` Ebru Akagunduz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).