* [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
@ 2016-09-07 16:45 Huang, Ying
  2016-09-07 16:46 ` [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64 Huang, Ying
                   ` (12 more replies)
  0 siblings, 13 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-07 16:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Huang Ying, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko
From: Huang Ying <ying.huang@intel.com>
This patchset is to optimize the performance of Transparent Huge Page
(THP) swap.
Hi, Andrew, could you help me to check whether the overall design is
reasonable?
Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
swap part of the patchset?  Especially [01/10], [04/10], [05/10],
[06/10], [07/10], [10/10].
Hi, Andrea and Kirill, could you help me to review the THP part of the
patchset?  Especially [02/10], [03/10], [09/10] and [10/10].
Hi, Johannes, Michal and Vladimir, I am not very confident about the
memory cgroup part, especially [02/10] and [03/10].  Could you help me
to review it?
And for all, Any comment is welcome!
Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth when do page swap out even on a
high-end server machine.  Because the performance of the storage
device improved faster than that of CPU.  And it seems that the trend
will not change in the near future.  On the other hand, the THP
becomes more and more popular because of increased memory size.  So it
becomes necessary to optimize THP swap performance.
The advantages of the THP swap support include:
- Batch the swap operations for the THP to reduce lock
  acquiring/releasing, including allocating/freeing the swap space,
  adding/deleting to/from the swap cache, and writing/reading the swap
  space, etc.  This will help improve the performance of the THP swap.
- The THP swap space read/write will be 2M sequential IO.  It is
  particularly helpful for the swap read, which usually are 4k random
  IO.  This will improve the performance of the THP swap too.
- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The 2M continuous pages will be
  free up after THP swapping out.
This patchset is based on 8/31 head of mmotm/master.
This patchset is the first step for the THP swap support.  The plan is
to delay splitting THP step by step, finally avoid splitting THP
during the THP swapping out and swap out/in the THP as a whole.
As the first step, in this patchset, the splitting huge page is
delayed from almost the first step of swapping out to after allocating
the swap space for the THP and adding the THP into the swap cache.
This will reduce lock acquiring/releasing for the locks used for the
swap cache management.
With the patchset, the swap out throughput improves 12.1% (from about
1.12GB/s to about 1.25GB/s) in the vm-scalability swap-w-seq test case
with 16 processes.  The test is done on a Xeon E5 v3 system.  The swap
device used is a RAM simulated PMEM (persistent memory) device.  To
test the sequential swapping out, the test case uses 16 processes,
which sequentially allocate and write to the anonymous pages until the
RAM and part of the swap device is used up.
The detailed compare result is as follow,
base             base+patchset
---------------- -------------------------- 
         %stddev     %change         %stddev
             \          |                \  
   1118821 ±  0%     +12.1%    1254241 ±  1%  vmstat.swap.so
   2460636 ±  1%     +10.6%    2720983 ±  1%  vm-scalability.throughput
    308.79 ±  1%      -7.9%     284.53 ±  1%  vm-scalability.time.elapsed_time
      1639 ±  4%    +232.3%       5446 ±  1%  meminfo.SwapCached
      0.70 ±  3%      +8.7%       0.77 ±  5%  perf-stat.ipc
      9.82 ±  8%     -31.6%       6.72 ±  2%  perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
>From the swap out throughput number, we can find, even tested on a RAM
simulated PMEM (Persistent Memory) device, the swap out throughput can
reach only about 1.1GB/s.  While, in the file IO test, the sequential
write throughput of an Intel P3700 SSD can reach about 1.8GB/s
steadily.  And according the following URL,
https://www-ssl.intel.com/content/www/us/en/solid-state-drives/intel-ssd-dc-family-for-pcie.html
The sequential write throughput of Intel P3608 SSD can reach about
3.0GB/s, while the random read IOPS can reach about 850k.  It is clear
that the bottleneck has moved from the disk to the kernel swap
component itself.
The improved storage device performance should have made the swap
becomes a better feature than before with better performance.  But
because of the issues of kernel swap component itself, the swap
performance is still kept at the low level.  That prevents the swap
feature to be used by more users.  And this in turn causes few kernel
developers think it is necessary to optimize kernel swap component.
To break the loop, we need to optimize the performance of kernel swap
component.  Optimize the THP swap performance is part of it.
Changelog:
v3:
- Per Andrew's suggestion, used a more systematical way to determine
  whether to enable THP swap optimization
- Per Andrew's comments, moved as much as possible code into
  #ifdef CONFIG_TRANSPARENT_HUGE_PAGE/#endif or "if (PageTransHuge())"
- Fixed some coding style warning.
v2:
- Original [1/11] sent separately and merged
- Use switch in 10/10 per Hiff's suggestion
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
* [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64
  2016-09-07 16:45 [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Huang, Ying
@ 2016-09-07 16:46 ` Huang, Ying
  2016-09-08  5:45   ` Anshuman Khandual
                     ` (3 more replies)
  2016-09-07 16:46 ` [PATCH -v3 02/10] mm, memcg: Add swap_cgroup_iter iterator Huang, Ying
                   ` (11 subsequent siblings)
  12 siblings, 4 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-07 16:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Huang Ying, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel
From: Huang Ying <ying.huang@intel.com>
In this patch, the size of the swap cluster is changed to that of the
THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
the THP swap support on x86_64.  Where one swap cluster will be used to
hold the contents of each THP swapped out.  And some information of the
swapped out THP (such as compound map count) will be recorded in the
swap_cluster_info data structure.
For other architectures which want THP swap support, THP_SWAP_CLUSTER
need to be selected in the Kconfig file for the architecture.
In effect, this will enlarge swap cluster size by 2 times on x86_64.
Which may make it harder to find a free cluster when the swap space
becomes fragmented.  So that, this may reduce the continuous swap space
allocation and sequential write in theory.  The performance test in 0day
shows no regressions caused by this.
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 arch/x86/Kconfig |  1 +
 mm/Kconfig       | 13 +++++++++++++
 mm/swapfile.c    |  4 ++++
 3 files changed, 18 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4c39728..421d862 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -164,6 +164,7 @@ config X86
 	select HAVE_STACK_VALIDATION		if X86_64
 	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
+	select ARCH_USES_THP_SWAP_CLUSTER	if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/mm/Kconfig b/mm/Kconfig
index be0ee11..2da8128 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -503,6 +503,19 @@ config FRONTSWAP
 
 	  If unsure, say Y to enable frontswap.
 
+config ARCH_USES_THP_SWAP_CLUSTER
+	bool
+	default n
+
+config THP_SWAP_CLUSTER
+	bool
+	depends on SWAP && TRANSPARENT_HUGEPAGE && ARCH_USES_THP_SWAP_CLUSTER
+	default y
+	help
+	  Use one swap cluster to hold the contents of the THP
+	  (Transparent Huge Page) swapped out.  The size of the swap
+	  cluster will be same as that of THP.
+
 config CMA
 	bool "Contiguous Memory Allocator"
 	depends on HAVE_MEMBLOCK && MMU
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8f1b97d..4b78402 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -196,7 +196,11 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 	}
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+#define SWAPFILE_CLUSTER	(HPAGE_SIZE / PAGE_SIZE)
+#else
 #define SWAPFILE_CLUSTER	256
+#endif
 #define LATENCY_LIMIT		256
 
 static inline void cluster_set_flag(struct swap_cluster_info *info,
-- 
2.8.1
^ permalink raw reply related	[flat|nested] 60+ messages in thread
* [PATCH -v3 02/10] mm, memcg: Add swap_cgroup_iter iterator
  2016-09-07 16:45 [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Huang, Ying
  2016-09-07 16:46 ` [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64 Huang, Ying
@ 2016-09-07 16:46 ` Huang, Ying
  2016-09-07 16:46 ` [PATCH -v3 03/10] mm, memcg: Support to charge/uncharge multiple swap entries Huang, Ying
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-07 16:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Huang Ying, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Tejun Heo,
	cgroups
From: Huang Ying <ying.huang@intel.com>
The swap cgroup uses a kind of discontinuous array to record the
information for the swap entries.  lookup_swap_cgroup() provides a good
encapsulation to access one element of the discontinuous array.  To make
it easier to access multiple elements of the discontinuous array, an
iterator for the swap cgroup named swap_cgroup_iter is added in this
patch.
This will be used for transparent huge page (THP) swap support.  Where
the swap_cgroup for multiple swap entries will be changed together.
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/swap_cgroup.c | 63 ++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 47 insertions(+), 16 deletions(-)
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index 310ac0b..4ae3e7b 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -18,6 +18,13 @@ struct swap_cgroup {
 };
 #define SC_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
 
+struct swap_cgroup_iter {
+	struct swap_cgroup_ctrl *ctrl;
+	struct swap_cgroup *sc;
+	swp_entry_t entry;
+	unsigned long flags;
+};
+
 /*
  * SwapCgroup implements "lookup" and "exchange" operations.
  * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
@@ -75,6 +82,35 @@ static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
 	return sc + offset % SC_PER_PAGE;
 }
 
+static void swap_cgroup_iter_init(struct swap_cgroup_iter *iter,
+				  swp_entry_t ent)
+{
+	iter->entry = ent;
+	iter->sc = lookup_swap_cgroup(ent, &iter->ctrl);
+	spin_lock_irqsave(&iter->ctrl->lock, iter->flags);
+}
+
+static void swap_cgroup_iter_exit(struct swap_cgroup_iter *iter)
+{
+	spin_unlock_irqrestore(&iter->ctrl->lock, iter->flags);
+}
+
+/*
+ * swap_cgroup is stored in a kind of discontinuous array.  That is,
+ * they are continuous in one page, but not across page boundary.  And
+ * there is one lock for each page.
+ */
+static void swap_cgroup_iter_advance(struct swap_cgroup_iter *iter)
+{
+	iter->sc++;
+	iter->entry.val++;
+	if (!(((unsigned long)iter->sc) & PAGE_MASK)) {
+		spin_unlock_irqrestore(&iter->ctrl->lock, iter->flags);
+		iter->sc = lookup_swap_cgroup(iter->entry, &iter->ctrl);
+		spin_lock_irqsave(&iter->ctrl->lock, iter->flags);
+	}
+}
+
 /**
  * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
  * @ent: swap entry to be cmpxchged
@@ -87,20 +123,18 @@ static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
 unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 					unsigned short old, unsigned short new)
 {
-	struct swap_cgroup_ctrl *ctrl;
-	struct swap_cgroup *sc;
-	unsigned long flags;
+	struct swap_cgroup_iter iter;
 	unsigned short retval;
 
-	sc = lookup_swap_cgroup(ent, &ctrl);
+	swap_cgroup_iter_init(&iter, ent);
 
-	spin_lock_irqsave(&ctrl->lock, flags);
-	retval = sc->id;
+	retval = iter.sc->id;
 	if (retval == old)
-		sc->id = new;
+		iter.sc->id = new;
 	else
 		retval = 0;
-	spin_unlock_irqrestore(&ctrl->lock, flags);
+
+	swap_cgroup_iter_exit(&iter);
 	return retval;
 }
 
@@ -114,18 +148,15 @@ unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
  */
 unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
 {
-	struct swap_cgroup_ctrl *ctrl;
-	struct swap_cgroup *sc;
+	struct swap_cgroup_iter iter;
 	unsigned short old;
-	unsigned long flags;
 
-	sc = lookup_swap_cgroup(ent, &ctrl);
+	swap_cgroup_iter_init(&iter, ent);
 
-	spin_lock_irqsave(&ctrl->lock, flags);
-	old = sc->id;
-	sc->id = id;
-	spin_unlock_irqrestore(&ctrl->lock, flags);
+	old = iter.sc->id;
+	iter.sc->id = id;
 
+	swap_cgroup_iter_exit(&iter);
 	return old;
 }
 
-- 
2.8.1
^ permalink raw reply related	[flat|nested] 60+ messages in thread
* [PATCH -v3 03/10] mm, memcg: Support to charge/uncharge multiple swap entries
  2016-09-07 16:45 [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Huang, Ying
  2016-09-07 16:46 ` [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64 Huang, Ying
  2016-09-07 16:46 ` [PATCH -v3 02/10] mm, memcg: Add swap_cgroup_iter iterator Huang, Ying
@ 2016-09-07 16:46 ` Huang, Ying
  2016-09-08  5:46   ` Anshuman Khandual
  2016-09-08  8:28   ` Anshuman Khandual
  2016-09-07 16:46 ` [PATCH -v3 04/10] mm, THP, swap: Add swap cluster allocate/free functions Huang, Ying
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-07 16:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Huang Ying, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Tejun Heo,
	cgroups
From: Huang Ying <ying.huang@intel.com>
This patch make it possible to charge or uncharge a set of continuous
swap entries in the swap cgroup.  The number of swap entries is
specified via an added parameter.
This will be used for the THP (Transparent Huge Page) swap support.
Where a swap cluster backing a THP may be allocated and freed as a
whole.  So a set of continuous swap entries (512 on x86_64) backing one
THP need to be charged or uncharged together.  This will batch the
cgroup operations for the THP swap too.
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h        | 12 ++++++----
 include/linux/swap_cgroup.h |  6 +++--
 mm/memcontrol.c             | 55 +++++++++++++++++++++++++--------------------
 mm/shmem.c                  |  2 +-
 mm/swap_cgroup.c            | 17 ++++++++++----
 mm/swap_state.c             |  2 +-
 mm/swapfile.c               |  2 +-
 7 files changed, 59 insertions(+), 37 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ed41bec..75aad24 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -550,8 +550,10 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
 
 #ifdef CONFIG_MEMCG_SWAP
 extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
-extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry);
-extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
+extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry,
+				      unsigned int nr_entries);
+extern void mem_cgroup_uncharge_swap(swp_entry_t entry,
+				     unsigned int nr_entries);
 extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg);
 extern bool mem_cgroup_swap_full(struct page *page);
 #else
@@ -560,12 +562,14 @@ static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 }
 
 static inline int mem_cgroup_try_charge_swap(struct page *page,
-					     swp_entry_t entry)
+					     swp_entry_t entry,
+					     unsigned int nr_entries)
 {
 	return 0;
 }
 
-static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
+static inline void mem_cgroup_uncharge_swap(swp_entry_t entry,
+					    unsigned int nr_entries)
 {
 }
 
diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index 145306b..b2b8ec7 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -7,7 +7,8 @@
 
 extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 					unsigned short old, unsigned short new);
-extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
+extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
+					 unsigned int nr_ents);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
 extern void swap_cgroup_swapoff(int type);
@@ -15,7 +16,8 @@ extern void swap_cgroup_swapoff(int type);
 #else
 
 static inline
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
+unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
+				  unsigned int nr_ents)
 {
 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bdb796f..9662fcf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2370,10 +2370,9 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 
 #ifdef CONFIG_MEMCG_SWAP
 static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
-					 bool charge)
+				       int nr_entries)
 {
-	int val = (charge) ? 1 : -1;
-	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
+	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], nr_entries);
 }
 
 /**
@@ -2399,8 +2398,8 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
 	new_id = mem_cgroup_id(to);
 
 	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
-		mem_cgroup_swap_statistics(from, false);
-		mem_cgroup_swap_statistics(to, true);
+		mem_cgroup_swap_statistics(from, -1);
+		mem_cgroup_swap_statistics(to, 1);
 		return 0;
 	}
 	return -EINVAL;
@@ -5417,7 +5416,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 		 * let's not wait for it.  The page already received a
 		 * memory+swap charge, drop the swap entry duplicate.
 		 */
-		mem_cgroup_uncharge_swap(entry);
+		mem_cgroup_uncharge_swap(entry, nr_pages);
 	}
 }
 
@@ -5825,9 +5824,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	 * ancestor for the swap instead and transfer the memory+swap charge.
 	 */
 	swap_memcg = mem_cgroup_id_get_online(memcg);
-	oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg));
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg), 1);
 	VM_BUG_ON_PAGE(oldid, page);
-	mem_cgroup_swap_statistics(swap_memcg, true);
+	mem_cgroup_swap_statistics(swap_memcg, 1);
 
 	page->mem_cgroup = NULL;
 
@@ -5854,16 +5853,19 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 		css_put(&memcg->css);
 }
 
-/*
- * mem_cgroup_try_charge_swap - try charging a swap entry
+/**
+ * mem_cgroup_try_charge_swap - try charging a set of swap entries
  * @page: page being added to swap
- * @entry: swap entry to charge
+ * @entry: the first swap entry to charge
+ * @nr_entries: the number of swap entries to charge
  *
- * Try to charge @entry to the memcg that @page belongs to.
+ * Try to charge @nr_entries swap entries starting from @entry to the
+ * memcg that @page belongs to.
  *
  * Returns 0 on success, -ENOMEM on failure.
  */
-int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
+int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry,
+			       unsigned int nr_entries)
 {
 	struct mem_cgroup *memcg;
 	struct page_counter *counter;
@@ -5881,25 +5883,29 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 	memcg = mem_cgroup_id_get_online(memcg);
 
 	if (!mem_cgroup_is_root(memcg) &&
-	    !page_counter_try_charge(&memcg->swap, 1, &counter)) {
+	    !page_counter_try_charge(&memcg->swap, nr_entries, &counter)) {
 		mem_cgroup_id_put(memcg);
 		return -ENOMEM;
 	}
 
-	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
+	if (nr_entries > 1)
+		mem_cgroup_id_get_many(memcg, nr_entries - 1);
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg), nr_entries);
 	VM_BUG_ON_PAGE(oldid, page);
-	mem_cgroup_swap_statistics(memcg, true);
+	mem_cgroup_swap_statistics(memcg, nr_entries);
 
 	return 0;
 }
 
 /**
- * mem_cgroup_uncharge_swap - uncharge a swap entry
- * @entry: swap entry to uncharge
+ * mem_cgroup_uncharge_swap - uncharge a set of swap entries
+ * @entry: the first swap entry to uncharge
+ * @nr_entries: the number of swap entries to uncharge
  *
- * Drop the swap charge associated with @entry.
+ * Drop the swap charge associated with @nr_entries swap entries
+ * starting from @entry.
  */
-void mem_cgroup_uncharge_swap(swp_entry_t entry)
+void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_entries)
 {
 	struct mem_cgroup *memcg;
 	unsigned short id;
@@ -5907,17 +5913,18 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
 	if (!do_swap_account)
 		return;
 
-	id = swap_cgroup_record(entry, 0);
+	id = swap_cgroup_record(entry, 0, nr_entries);
 	rcu_read_lock();
 	memcg = mem_cgroup_from_id(id);
 	if (memcg) {
 		if (!mem_cgroup_is_root(memcg)) {
 			if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
-				page_counter_uncharge(&memcg->swap, 1);
+				page_counter_uncharge(&memcg->swap, nr_entries);
 			else
-				page_counter_uncharge(&memcg->memsw, 1);
+				page_counter_uncharge(&memcg->memsw,
+						      nr_entries);
 		}
-		mem_cgroup_swap_statistics(memcg, false);
+		mem_cgroup_swap_statistics(memcg, -nr_entries);
 		mem_cgroup_id_put(memcg);
 	}
 	rcu_read_unlock();
diff --git a/mm/shmem.c b/mm/shmem.c
index ac35ebd..baeb2f9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1248,7 +1248,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if (!swap.val)
 		goto redirty;
 
-	if (mem_cgroup_try_charge_swap(page, swap))
+	if (mem_cgroup_try_charge_swap(page, swap, 1))
 		goto free_swap;
 
 	/*
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index 4ae3e7b..4d3484f 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -139,14 +139,16 @@ unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 }
 
 /**
- * swap_cgroup_record - record mem_cgroup for this swp_entry.
- * @ent: swap entry to be recorded into
+ * swap_cgroup_record - record mem_cgroup for a set of swap entries
+ * @ent: the first swap entry to be recorded into
  * @id: mem_cgroup to be recorded
+ * @nr_ents: number of swap entries to be recorded
  *
  * Returns old value at success, 0 at failure.
  * (Of course, old value can be 0.)
  */
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
+unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
+				  unsigned int nr_ents)
 {
 	struct swap_cgroup_iter iter;
 	unsigned short old;
@@ -154,7 +156,14 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
 	swap_cgroup_iter_init(&iter, ent);
 
 	old = iter.sc->id;
-	iter.sc->id = id;
+	for (;;) {
+		VM_BUG_ON(iter.sc->id != old);
+		iter.sc->id = id;
+		nr_ents--;
+		if (!nr_ents)
+			break;
+		swap_cgroup_iter_advance(&iter);
+	}
 
 	swap_cgroup_iter_exit(&iter);
 	return old;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8679c99..c335251 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -172,7 +172,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 	if (!entry.val)
 		return 0;
 
-	if (mem_cgroup_try_charge_swap(page, entry)) {
+	if (mem_cgroup_try_charge_swap(page, entry, 1)) {
 		swapcache_free(entry);
 		return 0;
 	}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4b78402..17f25e2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -806,7 +806,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 
 	/* free if no reference */
 	if (!usage) {
-		mem_cgroup_uncharge_swap(entry);
+		mem_cgroup_uncharge_swap(entry, 1);
 		dec_cluster_info_page(p, p->cluster_info, offset);
 		if (offset < p->lowest_bit)
 			p->lowest_bit = offset;
-- 
2.8.1
^ permalink raw reply related	[flat|nested] 60+ messages in thread
* [PATCH -v3 04/10] mm, THP, swap: Add swap cluster allocate/free functions
  2016-09-07 16:45 [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (2 preceding siblings ...)
  2016-09-07 16:46 ` [PATCH -v3 03/10] mm, memcg: Support to charge/uncharge multiple swap entries Huang, Ying
@ 2016-09-07 16:46 ` Huang, Ying
  2016-09-08  5:49   ` Anshuman Khandual
  2016-09-08  8:30   ` Anshuman Khandual
  2016-09-07 16:46 ` [PATCH -v3 05/10] mm, THP, swap: Add get_huge_swap_page() Huang, Ying
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-07 16:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Huang Ying, Andrea Arcangeli, Kirill A . Shutemov,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel
From: Huang Ying <ying.huang@intel.com>
The swap cluster allocation/free functions are added based on the
existing swap cluster management mechanism for SSD.  These functions
don't work for the rotating hard disks because the existing swap cluster
management mechanism doesn't work for them.  The hard disks support may
be added if someone really need it.  But that needn't be included in
this patchset.
This will be used for the THP (Transparent Huge Page) swap support.
Where one swap cluster will hold the contents of each THP swapped out.
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/swapfile.c | 203 +++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 146 insertions(+), 57 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 17f25e2..0132e8c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -326,6 +326,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 	schedule_work(&si->discard_work);
 }
 
+static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info;
+
+	cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
+	cluster_list_add_tail(&si->free_clusters, ci, idx);
+}
+
 /*
  * Doing discard actually. After a cluster discard is finished, the cluster
  * will be added to free cluster list. caller should hold si->lock.
@@ -345,8 +353,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
 				SWAPFILE_CLUSTER);
 
 		spin_lock(&si->lock);
-		cluster_set_flag(&info[idx], CLUSTER_FLAG_FREE);
-		cluster_list_add_tail(&si->free_clusters, info, idx);
+		__free_cluster(si, idx);
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
 	}
@@ -363,6 +370,34 @@ static void swap_discard_work(struct work_struct *work)
 	spin_unlock(&si->lock);
 }
 
+static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info;
+
+	VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
+	cluster_list_del_first(&si->free_clusters, ci);
+	cluster_set_count_flag(ci + idx, 0, 0);
+}
+
+static void free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info + idx;
+
+	VM_BUG_ON(cluster_count(ci) != 0);
+	/*
+	 * If the swap is discardable, prepare discard the cluster
+	 * instead of free it immediately. The cluster will be freed
+	 * after discard.
+	 */
+	if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
+	    (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
+		swap_cluster_schedule_discard(si, idx);
+		return;
+	}
+
+	__free_cluster(si, idx);
+}
+
 /*
  * The cluster corresponding to page_nr will be used. The cluster will be
  * removed from free cluster list and its usage counter will be increased.
@@ -374,11 +409,8 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
 
 	if (!cluster_info)
 		return;
-	if (cluster_is_free(&cluster_info[idx])) {
-		VM_BUG_ON(cluster_list_first(&p->free_clusters) != idx);
-		cluster_list_del_first(&p->free_clusters, cluster_info);
-		cluster_set_count_flag(&cluster_info[idx], 0, 0);
-	}
+	if (cluster_is_free(&cluster_info[idx]))
+		alloc_cluster(p, idx);
 
 	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
 	cluster_set_count(&cluster_info[idx],
@@ -402,21 +434,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
 	cluster_set_count(&cluster_info[idx],
 		cluster_count(&cluster_info[idx]) - 1);
 
-	if (cluster_count(&cluster_info[idx]) == 0) {
-		/*
-		 * If the swap is discardable, prepare discard the cluster
-		 * instead of free it immediately. The cluster will be freed
-		 * after discard.
-		 */
-		if ((p->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
-				 (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
-			swap_cluster_schedule_discard(p, idx);
-			return;
-		}
-
-		cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
-		cluster_list_add_tail(&p->free_clusters, cluster_info, idx);
-	}
+	if (cluster_count(&cluster_info[idx]) == 0)
+		free_cluster(p, idx);
 }
 
 /*
@@ -497,6 +516,69 @@ new_cluster:
 	*scan_base = tmp;
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+static inline unsigned int huge_cluster_nr_entries(bool huge)
+{
+	return huge ? SWAPFILE_CLUSTER : 1;
+}
+#else
+#define huge_cluster_nr_entries(huge)	1
+#endif
+
+static void __swap_entry_alloc(struct swap_info_struct *si,
+			       unsigned long offset, bool huge)
+{
+	unsigned int nr_entries = huge_cluster_nr_entries(huge);
+	unsigned int end = offset + nr_entries - 1;
+
+	if (offset == si->lowest_bit)
+		si->lowest_bit += nr_entries;
+	if (end == si->highest_bit)
+		si->highest_bit -= nr_entries;
+	si->inuse_pages += nr_entries;
+	if (si->inuse_pages == si->pages) {
+		si->lowest_bit = si->max;
+		si->highest_bit = 0;
+		spin_lock(&swap_avail_lock);
+		plist_del(&si->avail_list, &swap_avail_head);
+		spin_unlock(&swap_avail_lock);
+	}
+}
+
+static void __swap_entry_free(struct swap_info_struct *si, unsigned long offset,
+			      bool huge)
+{
+	unsigned int nr_entries = huge_cluster_nr_entries(huge);
+	unsigned long end = offset + nr_entries - 1;
+	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
+
+	if (offset < si->lowest_bit)
+		si->lowest_bit = offset;
+	if (end > si->highest_bit) {
+		bool was_full = !si->highest_bit;
+
+		si->highest_bit = end;
+		if (was_full && (si->flags & SWP_WRITEOK)) {
+			spin_lock(&swap_avail_lock);
+			WARN_ON(!plist_node_empty(&si->avail_list));
+			if (plist_node_empty(&si->avail_list))
+				plist_add(&si->avail_list, &swap_avail_head);
+			spin_unlock(&swap_avail_lock);
+		}
+	}
+	atomic_long_add(nr_entries, &nr_swap_pages);
+	si->inuse_pages -= nr_entries;
+	if (si->flags & SWP_BLKDEV)
+		swap_slot_free_notify =
+			si->bdev->bd_disk->fops->swap_slot_free_notify;
+	while (offset <= end) {
+		frontswap_invalidate_page(si->type, offset);
+		if (swap_slot_free_notify)
+			swap_slot_free_notify(si->bdev, offset);
+		offset++;
+	}
+}
+
 static unsigned long scan_swap_map(struct swap_info_struct *si,
 				   unsigned char usage)
 {
@@ -591,18 +673,7 @@ checks:
 	if (si->swap_map[offset])
 		goto scan;
 
-	if (offset == si->lowest_bit)
-		si->lowest_bit++;
-	if (offset == si->highest_bit)
-		si->highest_bit--;
-	si->inuse_pages++;
-	if (si->inuse_pages == si->pages) {
-		si->lowest_bit = si->max;
-		si->highest_bit = 0;
-		spin_lock(&swap_avail_lock);
-		plist_del(&si->avail_list, &swap_avail_head);
-		spin_unlock(&swap_avail_lock);
-	}
+	__swap_entry_alloc(si, offset, false);
 	si->swap_map[offset] = usage;
 	inc_cluster_info_page(si, si->cluster_info, offset);
 	si->cluster_next = offset + 1;
@@ -649,6 +720,46 @@ no_page:
 	return 0;
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+static void swap_free_huge_cluster(struct swap_info_struct *si,
+				   unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info + idx;
+	unsigned long offset = idx * SWAPFILE_CLUSTER;
+
+	cluster_set_count_flag(ci, 0, 0);
+	free_cluster(si, idx);
+	__swap_entry_free(si, offset, true);
+}
+
+static unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si)
+{
+	unsigned long idx;
+	struct swap_cluster_info *ci;
+	unsigned long offset, i;
+	unsigned char *map;
+
+	if (cluster_list_empty(&si->free_clusters))
+		return 0;
+	idx = cluster_list_first(&si->free_clusters);
+	alloc_cluster(si, idx);
+	ci = si->cluster_info + idx;
+	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
+
+	offset = idx * SWAPFILE_CLUSTER;
+	__swap_entry_alloc(si, offset, true);
+	map = si->swap_map + offset;
+	for (i = 0; i < SWAPFILE_CLUSTER; i++)
+		map[i] = SWAP_HAS_CACHE;
+	return offset;
+}
+#else
+static inline unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si)
+{
+	return 0;
+}
+#endif
+
 swp_entry_t get_swap_page(void)
 {
 	struct swap_info_struct *si, *next;
@@ -808,29 +919,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 	if (!usage) {
 		mem_cgroup_uncharge_swap(entry, 1);
 		dec_cluster_info_page(p, p->cluster_info, offset);
-		if (offset < p->lowest_bit)
-			p->lowest_bit = offset;
-		if (offset > p->highest_bit) {
-			bool was_full = !p->highest_bit;
-			p->highest_bit = offset;
-			if (was_full && (p->flags & SWP_WRITEOK)) {
-				spin_lock(&swap_avail_lock);
-				WARN_ON(!plist_node_empty(&p->avail_list));
-				if (plist_node_empty(&p->avail_list))
-					plist_add(&p->avail_list,
-						  &swap_avail_head);
-				spin_unlock(&swap_avail_lock);
-			}
-		}
-		atomic_long_inc(&nr_swap_pages);
-		p->inuse_pages--;
-		frontswap_invalidate_page(p->type, offset);
-		if (p->flags & SWP_BLKDEV) {
-			struct gendisk *disk = p->bdev->bd_disk;
-			if (disk->fops->swap_slot_free_notify)
-				disk->fops->swap_slot_free_notify(p->bdev,
-								  offset);
-		}
+		__swap_entry_free(p, offset, false);
 	}
 
 	return usage;
-- 
2.8.1
^ permalink raw reply related	[flat|nested] 60+ messages in thread
* [PATCH -v3 05/10] mm, THP, swap: Add get_huge_swap_page()
  2016-09-07 16:45 [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (3 preceding siblings ...)
  2016-09-07 16:46 ` [PATCH -v3 04/10] mm, THP, swap: Add swap cluster allocate/free functions Huang, Ying
@ 2016-09-07 16:46 ` Huang, Ying
  2016-09-08 11:13   ` Kirill A. Shutemov
  2016-09-07 16:46 ` [PATCH -v3 06/10] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page Huang, Ying
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 60+ messages in thread
From: Huang, Ying @ 2016-09-07 16:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Huang Ying, Andrea Arcangeli, Kirill A . Shutemov,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel
From: Huang Ying <ying.huang@intel.com>
A variation of get_swap_page(), get_huge_swap_page(), is added to
allocate a swap cluster (512 swap slots) based on the swap cluster
allocation function.  A fair simple algorithm is used, that is, only the
first swap device in priority list will be tried to allocate the swap
cluster.  The function will fail if the trying is not successful, and
the caller will fallback to allocate a single swap slot instead.  This
works good enough for normal cases.
This will be used for the THP (Transparent Huge Page) swap support.
Where get_huge_swap_page() will be used to allocate one swap cluster for
each THP swapped out.
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h | 24 +++++++++++++++++++++++-
 mm/swapfile.c        | 18 ++++++++++++------
 2 files changed, 35 insertions(+), 7 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 75aad24..bc0a84d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -399,7 +399,7 @@ static inline long get_nr_swap_pages(void)
 }
 
 extern void si_swapinfo(struct sysinfo *);
-extern swp_entry_t get_swap_page(void);
+extern swp_entry_t __get_swap_page(bool huge);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
@@ -419,6 +419,23 @@ extern bool reuse_swap_page(struct page *, int *);
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 
+static inline swp_entry_t get_swap_page(void)
+{
+	return __get_swap_page(false);
+}
+
+#ifdef CONFIG_THP_SWAP_CLUSTER
+static inline swp_entry_t get_huge_swap_page(void)
+{
+	return __get_swap_page(true);
+}
+#else
+static inline swp_entry_t get_huge_swap_page(void)
+{
+	return (swp_entry_t) {0};
+}
+#endif
+
 #else /* CONFIG_SWAP */
 
 #define swap_address_space(entry)		(NULL)
@@ -525,6 +542,11 @@ static inline swp_entry_t get_swap_page(void)
 	return entry;
 }
 
+static inline swp_entry_t get_huge_swap_page(void)
+{
+	return (swp_entry_t) {0};
+}
+
 #endif /* CONFIG_SWAP */
 
 #ifdef CONFIG_MEMCG
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0132e8c..3d2bd1f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -760,14 +760,15 @@ static inline unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si)
 }
 #endif
 
-swp_entry_t get_swap_page(void)
+swp_entry_t __get_swap_page(bool huge)
 {
 	struct swap_info_struct *si, *next;
 	pgoff_t offset;
+	int nr_pages = huge_cluster_nr_entries(huge);
 
-	if (atomic_long_read(&nr_swap_pages) <= 0)
+	if (atomic_long_read(&nr_swap_pages) < nr_pages)
 		goto noswap;
-	atomic_long_dec(&nr_swap_pages);
+	atomic_long_sub(nr_pages, &nr_swap_pages);
 
 	spin_lock(&swap_avail_lock);
 
@@ -795,10 +796,15 @@ start_over:
 		}
 
 		/* This is called for allocating swap entry for cache */
-		offset = scan_swap_map(si, SWAP_HAS_CACHE);
+		if (likely(nr_pages == 1))
+			offset = scan_swap_map(si, SWAP_HAS_CACHE);
+		else
+			offset = swap_alloc_huge_cluster(si);
 		spin_unlock(&si->lock);
 		if (offset)
 			return swp_entry(si->type, offset);
+		else if (unlikely(nr_pages != 1))
+			goto fail_alloc;
 		pr_debug("scan_swap_map of si %d failed to find offset\n",
 		       si->type);
 		spin_lock(&swap_avail_lock);
@@ -818,8 +824,8 @@ nextsi:
 	}
 
 	spin_unlock(&swap_avail_lock);
-
-	atomic_long_inc(&nr_swap_pages);
+fail_alloc:
+	atomic_long_add(nr_pages, &nr_swap_pages);
 noswap:
 	return (swp_entry_t) {0};
 }
-- 
2.8.1
^ permalink raw reply related	[flat|nested] 60+ messages in thread
* [PATCH -v3 06/10] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page
  2016-09-07 16:45 [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (4 preceding siblings ...)
  2016-09-07 16:46 ` [PATCH -v3 05/10] mm, THP, swap: Add get_huge_swap_page() Huang, Ying
@ 2016-09-07 16:46 ` Huang, Ying
  2016-09-07 16:46 ` [PATCH -v3 07/10] mm, THP, swap: Support to add/delete THP to/from swap cache Huang, Ying
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-07 16:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Huang Ying, Andrea Arcangeli, Kirill A . Shutemov,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel
From: Huang Ying <ying.huang@intel.com>
__swapcache_free() is added to support to clear the SWAP_HAS_CACHE flag
for the huge page.  This will free the specified swap cluster now.
Because now this function will be called only in the error path to free
the swap cluster just allocated.  So the corresponding swap_map[i] ==
SWAP_HAS_CACHE, that is, the swap count is 0.  This makes the
implementation simpler than that of the ordinary swap entry.
This will be used for delaying splitting THP (Transparent Huge Page)
during swapping out.  Where for one THP to swap out, we will allocate a
swap cluster, add the THP into the swap cache, then split the THP.  If
anything fails after allocating the swap cluster and before splitting
the THP successfully, the swapcache_free_trans_huge() will be used to
free the swap space allocated.
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h |  9 +++++++--
 mm/swapfile.c        | 32 ++++++++++++++++++++++++++++++--
 2 files changed, 37 insertions(+), 4 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index bc0a84d..7be7599 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -406,7 +406,7 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t);
+extern void __swapcache_free(swp_entry_t, bool);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
@@ -478,7 +478,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp)
+static inline void __swapcache_free(swp_entry_t swp, bool huge)
 {
 }
 
@@ -549,6 +549,11 @@ static inline swp_entry_t get_huge_swap_page(void)
 
 #endif /* CONFIG_SWAP */
 
+static inline void swapcache_free(swp_entry_t entry)
+{
+	__swapcache_free(entry, false);
+}
+
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3d2bd1f..26b75fa 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -732,6 +732,26 @@ static void swap_free_huge_cluster(struct swap_info_struct *si,
 	__swap_entry_free(si, offset, true);
 }
 
+/*
+ * Caller should hold si->lock.
+ */
+static void swapcache_free_trans_huge(struct swap_info_struct *si,
+				      swp_entry_t entry)
+{
+	unsigned long offset = swp_offset(entry);
+	unsigned long idx = offset / SWAPFILE_CLUSTER;
+	unsigned char *map;
+	unsigned int i;
+
+	map = si->swap_map + offset;
+	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+		VM_BUG_ON(map[i] != SWAP_HAS_CACHE);
+		map[i] &= ~SWAP_HAS_CACHE;
+	}
+	mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
+	swap_free_huge_cluster(si, idx);
+}
+
 static unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si)
 {
 	unsigned long idx;
@@ -758,6 +778,11 @@ static inline unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si)
 {
 	return 0;
 }
+
+static inline void swapcache_free_trans_huge(struct swap_info_struct *si,
+					     swp_entry_t entry)
+{
+}
 #endif
 
 swp_entry_t __get_swap_page(bool huge)
@@ -949,13 +974,16 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry)
+void __swapcache_free(swp_entry_t entry, bool huge)
 {
 	struct swap_info_struct *p;
 
 	p = swap_info_get(entry);
 	if (p) {
-		swap_entry_free(p, entry, SWAP_HAS_CACHE);
+		if (unlikely(huge))
+			swapcache_free_trans_huge(p, entry);
+		else
+			swap_entry_free(p, entry, SWAP_HAS_CACHE);
 		spin_unlock(&p->lock);
 	}
 }
-- 
2.8.1
^ permalink raw reply related	[flat|nested] 60+ messages in thread
* [PATCH -v3 07/10] mm, THP, swap: Support to add/delete THP to/from swap cache
  2016-09-07 16:45 [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (5 preceding siblings ...)
  2016-09-07 16:46 ` [PATCH -v3 06/10] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page Huang, Ying
@ 2016-09-07 16:46 ` Huang, Ying
  2016-09-08  9:00   ` Anshuman Khandual
  2016-09-07 16:46 ` [PATCH -v3 08/10] mm, THP: Add can_split_huge_page() Huang, Ying
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 60+ messages in thread
From: Huang, Ying @ 2016-09-07 16:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Huang Ying, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov
From: Huang Ying <ying.huang@intel.com>
With this patch, a THP (Transparent Huge Page) can be added/deleted
to/from the swap cache as a set of sub-pages (512 on x86_64).
This will be used for the THP (Transparent Huge Page) swap support.
Where one THP may be added/delted to/from the swap cache.  This will
batch the swap cache operations to reduce the lock acquire/release times
for the THP swap too.
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/page-flags.h |  2 +-
 mm/swap_state.c            | 57 +++++++++++++++++++++++++++++++---------------
 2 files changed, 40 insertions(+), 19 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda..f5bcbea 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -314,7 +314,7 @@ PAGEFLAG_FALSE(HighMem)
 #endif
 
 #ifdef CONFIG_SWAP
-PAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
+PAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
 #else
 PAGEFLAG_FALSE(SwapCache)
 #endif
diff --git a/mm/swap_state.c b/mm/swap_state.c
index c335251..db2299f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -43,6 +43,7 @@ struct address_space swapper_spaces[MAX_SWAPFILES] = {
 };
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
+#define ADD_CACHE_INFO(x, nr)	do { swap_cache_info.x += (nr); } while (0)
 
 static struct {
 	unsigned long add_total;
@@ -80,25 +81,32 @@ void show_swap_cache_info(void)
  */
 int __add_to_swap_cache(struct page *page, swp_entry_t entry)
 {
-	int error;
+	int error, i, nr = hpage_nr_pages(page);
 	struct address_space *address_space;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapCache(page), page);
 	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
-	get_page(page);
+	page_ref_add(page, nr);
 	SetPageSwapCache(page);
-	set_page_private(page, entry.val);
 
 	address_space = swap_address_space(entry);
 	spin_lock_irq(&address_space->tree_lock);
-	error = radix_tree_insert(&address_space->page_tree,
-					entry.val, page);
+	for (i = 0; i < nr; i++) {
+		struct page *cur_page = page + i;
+		unsigned long index = entry.val + i;
+
+		set_page_private(cur_page, index);
+		error = radix_tree_insert(&address_space->page_tree,
+					  index, cur_page);
+		if (unlikely(error))
+			break;
+	}
 	if (likely(!error)) {
-		address_space->nrpages++;
-		__inc_node_page_state(page, NR_FILE_PAGES);
-		INC_CACHE_INFO(add_total);
+		address_space->nrpages += nr;
+		__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
+		ADD_CACHE_INFO(add_total, nr);
 	}
 	spin_unlock_irq(&address_space->tree_lock);
 
@@ -109,9 +117,16 @@ int __add_to_swap_cache(struct page *page, swp_entry_t entry)
 		 * So add_to_swap_cache() doesn't returns -EEXIST.
 		 */
 		VM_BUG_ON(error == -EEXIST);
-		set_page_private(page, 0UL);
 		ClearPageSwapCache(page);
-		put_page(page);
+		set_page_private(page + i, 0UL);
+		while (i--) {
+			struct page *cur_page = page + i;
+			unsigned long index = entry.val + i;
+
+			set_page_private(cur_page, 0UL);
+			radix_tree_delete(&address_space->page_tree, index);
+		}
+		page_ref_sub(page, nr);
 	}
 
 	return error;
@@ -122,7 +137,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
 {
 	int error;
 
-	error = radix_tree_maybe_preload(gfp_mask);
+	error = radix_tree_maybe_preload_order(gfp_mask, compound_order(page));
 	if (!error) {
 		error = __add_to_swap_cache(page, entry);
 		radix_tree_preload_end();
@@ -138,6 +153,7 @@ void __delete_from_swap_cache(struct page *page)
 {
 	swp_entry_t entry;
 	struct address_space *address_space;
+	int i, nr = hpage_nr_pages(page);
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
@@ -145,12 +161,17 @@ void __delete_from_swap_cache(struct page *page)
 
 	entry.val = page_private(page);
 	address_space = swap_address_space(entry);
-	radix_tree_delete(&address_space->page_tree, page_private(page));
-	set_page_private(page, 0);
 	ClearPageSwapCache(page);
-	address_space->nrpages--;
-	__dec_node_page_state(page, NR_FILE_PAGES);
-	INC_CACHE_INFO(del_total);
+	for (i = 0; i < nr; i++) {
+		struct page *cur_page = page + i;
+
+		radix_tree_delete(&address_space->page_tree,
+				  page_private(cur_page));
+		set_page_private(cur_page, 0);
+	}
+	address_space->nrpages -= nr;
+	__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
+	ADD_CACHE_INFO(del_total, nr);
 }
 
 /**
@@ -227,8 +248,8 @@ void delete_from_swap_cache(struct page *page)
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	swapcache_free(entry);
-	put_page(page);
+	__swapcache_free(entry, PageTransHuge(page));
+	page_ref_sub(page, hpage_nr_pages(page));
 }
 
 /* 
-- 
2.8.1
^ permalink raw reply related	[flat|nested] 60+ messages in thread
* [PATCH -v3 08/10] mm, THP: Add can_split_huge_page()
  2016-09-07 16:45 [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (6 preceding siblings ...)
  2016-09-07 16:46 ` [PATCH -v3 07/10] mm, THP, swap: Support to add/delete THP to/from swap cache Huang, Ying
@ 2016-09-07 16:46 ` Huang, Ying
  2016-09-08 11:17   ` Kirill A. Shutemov
  2016-09-07 16:46 ` [PATCH -v3 09/10] mm, THP, swap: Support to split THP in swap cache Huang, Ying
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 60+ messages in thread
From: Huang, Ying @ 2016-09-07 16:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Huang Ying, Andrea Arcangeli, Kirill A . Shutemov,
	Ebru Akagunduz
From: Huang Ying <ying.huang@intel.com>
Separates checking whether we can split the huge page from
split_huge_page_to_list() into a function.  This will help to check that
before splitting the THP (Transparent Huge Page) really.
This will be used for delaying splitting THP during swapping out.  Where
for a THP, we will allocate a swap cluster, add the THP into the swap
cache, then split the THP.  To avoid the unnecessary operations for the
un-splittable THP, we will check that firstly.
There is no functionality change in this patch.
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/huge_mm.h |  6 ++++++
 mm/huge_memory.c        | 13 ++++++++++++-
 2 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9b9f65d..a0073e7 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -94,6 +94,7 @@ extern unsigned long thp_get_unmapped_area(struct file *filp,
 extern void prep_transhuge_page(struct page *page);
 extern void free_transhuge_page(struct page *page);
 
+bool can_split_huge_page(struct page *page);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
@@ -176,6 +177,11 @@ static inline void prep_transhuge_page(struct page *page) {}
 
 #define thp_get_unmapped_area	NULL
 
+static inline bool
+can_split_huge_page(struct page *page)
+{
+	return false;
+}
 static inline int
 split_huge_page_to_list(struct page *page, struct list_head *list)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fc0d37e..3be5abe 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2016,6 +2016,17 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
 	return ret;
 }
 
+/* Racy check whether the huge page can be split */
+bool can_split_huge_page(struct page *page)
+{
+	int extra_pins = 0;
+
+	/* Additional pins from radix tree */
+	if (!PageAnon(page))
+		extra_pins = HPAGE_PMD_NR;
+	return total_mapcount(page) == page_count(page) - extra_pins - 1;
+}
+
 /*
  * This function splits huge page into normal pages. @page can point to any
  * subpage of huge page to split. Split doesn't change the position of @page.
@@ -2086,7 +2097,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	 * Racy check if we can split the page, before freeze_page() will
 	 * split PMDs
 	 */
-	if (total_mapcount(head) != page_count(head) - extra_pins - 1) {
+	if (!can_split_huge_page(head)) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
-- 
2.8.1
^ permalink raw reply related	[flat|nested] 60+ messages in thread
* [PATCH -v3 09/10] mm, THP, swap: Support to split THP in swap cache
  2016-09-07 16:45 [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (7 preceding siblings ...)
  2016-09-07 16:46 ` [PATCH -v3 08/10] mm, THP: Add can_split_huge_page() Huang, Ying
@ 2016-09-07 16:46 ` Huang, Ying
  2016-09-07 16:46 ` [PATCH -v3 10/10] mm, THP, swap: Delay splitting THP during swap out Huang, Ying
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-07 16:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Huang Ying, Andrea Arcangeli, Kirill A . Shutemov,
	Ebru Akagunduz
From: Huang Ying <ying.huang@intel.com>
This patch enhanced the split_huge_page_to_list() to work properly for
the THP (Transparent Huge Page) in the swap cache during swapping out.
This is used for delaying splitting the THP during swapping out.  Where
for a THP to be swapped out, we will allocate a swap cluster, add the
THP into the swap cache, then split the THP.  The page lock will be held
during this process.  So in the code path other than swapping out, if
the THP need to be split, the PageSwapCache(THP) will be always false.
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/huge_memory.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3be5abe..3bb4976 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1834,7 +1834,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	 * atomic_set() here would be safe on all archs (and not only on x86),
 	 * it's safer to use atomic_inc()/atomic_add().
 	 */
-	if (PageAnon(head)) {
+	if (PageAnon(head) && !PageSwapCache(head)) {
 		page_ref_inc(page_tail);
 	} else {
 		/* Additional pin to radix tree */
@@ -1845,6 +1845,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	page_tail->flags |= (head->flags &
 			((1L << PG_referenced) |
 			 (1L << PG_swapbacked) |
+			 (1L << PG_swapcache) |
 			 (1L << PG_mlocked) |
 			 (1L << PG_uptodate) |
 			 (1L << PG_active) |
@@ -1907,7 +1908,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	ClearPageCompound(head);
 	/* See comment in __split_huge_page_tail() */
 	if (PageAnon(head)) {
-		page_ref_inc(head);
+		/* Additional pin to radix tree of swap cache */
+		if (PageSwapCache(head))
+			page_ref_add(head, 2);
+		else
+			page_ref_inc(head);
 	} else {
 		/* Additional pin to radix tree */
 		page_ref_add(head, 2);
@@ -2019,10 +2024,12 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
 /* Racy check whether the huge page can be split */
 bool can_split_huge_page(struct page *page)
 {
-	int extra_pins = 0;
+	int extra_pins;
 
 	/* Additional pins from radix tree */
-	if (!PageAnon(page))
+	if (PageAnon(page))
+		extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0;
+	else
 		extra_pins = HPAGE_PMD_NR;
 	return total_mapcount(page) == page_count(page) - extra_pins - 1;
 }
@@ -2075,7 +2082,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			ret = -EBUSY;
 			goto out;
 		}
-		extra_pins = 0;
+		extra_pins = PageSwapCache(head) ? HPAGE_PMD_NR : 0;
 		mapping = NULL;
 		anon_vma_lock_write(anon_vma);
 	} else {
-- 
2.8.1
^ permalink raw reply related	[flat|nested] 60+ messages in thread
* [PATCH -v3 10/10] mm, THP, swap: Delay splitting THP during swap out
  2016-09-07 16:45 [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (8 preceding siblings ...)
  2016-09-07 16:46 ` [PATCH -v3 09/10] mm, THP, swap: Support to split THP in swap cache Huang, Ying
@ 2016-09-07 16:46 ` Huang, Ying
  2016-09-09  5:43 ` [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Minchan Kim
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-07 16:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Huang Ying
From: Huang Ying <ying.huang@intel.com>
In this patch, splitting huge page is delayed from almost the first step
of swapping out to after allocating the swap space for the
THP (Transparent Huge Page) and adding the THP into the swap cache.
This will reduce lock acquiring/releasing for the locks used for the
swap cache management.
This is the first step for the THP swap support.  The plan is to delay
splitting the THP step by step and avoid splitting the THP finally.
The advantages of the THP swap support include:
- Batch the swap operations for the THP to reduce lock
  acquiring/releasing, including allocating/freeing the swap space,
  adding/deleting to/from the swap cache, and writing/reading the swap
  space, etc.  This will help to improve the THP swap performance.
- The THP swap space read/write will be 2M sequential IO.  It is
  particularly helpful for the swap read, which usually are 4k random
  IO.  This will help to improve the THP swap performance too.
- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The 2M continuous pages will be
  free up after the THP swapping out.
With the patchset, the swap out throughput improved 12.1% (from 1.12GB/s
to 1.25GB/s) in the vm-scalability swap-w-seq test case with 16
processes.  The test is done on a Xeon E5 v3 system.  The RAM simulated
PMEM (persistent memory) device is used as the swap device.  To test
sequential swapping out, the test case uses 16 processes sequentially
allocate and write to the anonymous pages until the RAM and part of the
swap device is used up.
The detailed compare result is as follow,
base             base+patchset
---------------- --------------------------
         %stddev     %change         %stddev
             \          |                \
   1118821 ±  0%     +12.1%    1254241 ±  1%  vmstat.swap.so
   2460636 ±  1%     +10.6%    2720983 ±  1%  vm-scalability.throughput
    308.79 ±  1%      -7.9%     284.53 ±  1%  vm-scalability.time.elapsed_time
      1639 ±  4%    +232.3%       5446 ±  1%  meminfo.SwapCached
      0.70 ±  3%      +8.7%       0.77 ±  5%  perf-stat.ipc
      9.82 ±  8%     -31.6%       6.72 ±  2%  perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/swap_state.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 62 insertions(+), 3 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index db2299f..63b637a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -17,6 +17,7 @@
 #include <linux/blkdev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
+#include <linux/huge_mm.h>
 
 #include <asm/pgtable.h>
 
@@ -174,12 +175,53 @@ void __delete_from_swap_cache(struct page *page)
 	ADD_CACHE_INFO(del_total, nr);
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+int add_to_swap_trans_huge(struct page *page, struct list_head *list)
+{
+	swp_entry_t entry;
+	int ret = 0;
+
+	/* cannot split, which may be needed during swap in, skip it */
+	if (!can_split_huge_page(page))
+		return -EBUSY;
+	/* fallback to split huge page firstly if no PMD map */
+	if (!compound_mapcount(page))
+		return 0;
+	entry = get_huge_swap_page();
+	if (!entry.val)
+		return 0;
+	if (mem_cgroup_try_charge_swap(page, entry, HPAGE_PMD_NR)) {
+		__swapcache_free(entry, true);
+		return -EOVERFLOW;
+	}
+	ret = add_to_swap_cache(page, entry,
+				__GFP_HIGH | __GFP_NOMEMALLOC|__GFP_NOWARN);
+	/* -ENOMEM radix-tree allocation failure */
+	if (ret) {
+		__swapcache_free(entry, true);
+		return 0;
+	}
+	ret = split_huge_page_to_list(page, list);
+	if (ret) {
+		delete_from_swap_cache(page);
+		return -EBUSY;
+	}
+	return 1;
+}
+#else
+static inline int add_to_swap_trans_huge(struct page *page,
+					 struct list_head *list)
+{
+	return 0;
+}
+#endif
+
 /**
  * add_to_swap - allocate swap space for a page
  * @page: page we want to move to swap
  *
  * Allocate swap space for the page and add the page to the
- * swap cache.  Caller needs to hold the page lock. 
+ * swap cache.  Caller needs to hold the page lock.
  */
 int add_to_swap(struct page *page, struct list_head *list)
 {
@@ -189,6 +231,18 @@ int add_to_swap(struct page *page, struct list_head *list)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageUptodate(page), page);
 
+	if (unlikely(PageTransHuge(page))) {
+		err = add_to_swap_trans_huge(page, list);
+		switch (err) {
+		case 1:
+			return 1;
+		case 0:
+			/* fallback to split firstly if return 0 */
+			break;
+		default:
+			return 0;
+		}
+	}
 	entry = get_swap_page();
 	if (!entry.val)
 		return 0;
@@ -306,7 +360,7 @@ struct page * lookup_swap_cache(swp_entry_t entry)
 
 	page = find_get_page(swap_address_space(entry), entry.val);
 
-	if (page) {
+	if (page && likely(!PageTransCompound(page))) {
 		INC_CACHE_INFO(find_success);
 		if (TestClearPageReadahead(page))
 			atomic_inc(&swapin_readahead_hits);
@@ -332,8 +386,13 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * that would confuse statistics.
 		 */
 		found_page = find_get_page(swapper_space, entry.val);
-		if (found_page)
+		if (found_page) {
+			if (unlikely(PageTransCompound(found_page))) {
+				put_page(found_page);
+				found_page = NULL;
+			}
 			break;
+		}
 
 		/*
 		 * Get a new page to read into from swap.
-- 
2.8.1
^ permalink raw reply related	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64
  2016-09-07 16:46 ` [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64 Huang, Ying
@ 2016-09-08  5:45   ` Anshuman Khandual
  2016-09-08 18:07     ` Huang, Ying
  2016-09-19 17:09     ` Johannes Weiner
  2016-09-08  8:21   ` Anshuman Khandual
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 60+ messages in thread
From: Anshuman Khandual @ 2016-09-08  5:45 UTC (permalink / raw)
  To: Huang, Ying, Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel
On 09/07/2016 10:16 PM, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> In this patch, the size of the swap cluster is changed to that of the
> THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
> the THP swap support on x86_64.  Where one swap cluster will be used to
> hold the contents of each THP swapped out.  And some information of the
> swapped out THP (such as compound map count) will be recorded in the
> swap_cluster_info data structure.
> 
> For other architectures which want THP swap support, THP_SWAP_CLUSTER
> need to be selected in the Kconfig file for the architecture.
> 
> In effect, this will enlarge swap cluster size by 2 times on x86_64.
> Which may make it harder to find a free cluster when the swap space
> becomes fragmented.  So that, this may reduce the continuous swap space
> allocation and sequential write in theory.  The performance test in 0day
> shows no regressions caused by this.
This patch needs to be split into two separate ones
(1) Add THP_SWAP_CLUSTER config option
(2) Enable CONFIG_THP_SWAP_CLUSTER for X86_64
The first patch should explain the proposal and the second patch
should have 86_64 arch specific details, regressions etc as already
been explained in the commit message.
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 03/10] mm, memcg: Support to charge/uncharge multiple swap entries
  2016-09-07 16:46 ` [PATCH -v3 03/10] mm, memcg: Support to charge/uncharge multiple swap entries Huang, Ying
@ 2016-09-08  5:46   ` Anshuman Khandual
  2016-09-08  8:28   ` Anshuman Khandual
  1 sibling, 0 replies; 60+ messages in thread
From: Anshuman Khandual @ 2016-09-08  5:46 UTC (permalink / raw)
  To: Huang, Ying, Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Tejun Heo,
	cgroups
On 09/07/2016 10:16 PM, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> This patch make it possible to charge or uncharge a set of continuous
> swap entries in the swap cgroup.  The number of swap entries is
> specified via an added parameter.
> 
> This will be used for the THP (Transparent Huge Page) swap support.
> Where a swap cluster backing a THP may be allocated and freed as a
> whole.  So a set of continuous swap entries (512 on x86_64) backing one
Please use HPAGE_SIZE / PAGE_SIZE instead of hard coded number like 512.
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 04/10] mm, THP, swap: Add swap cluster allocate/free functions
  2016-09-07 16:46 ` [PATCH -v3 04/10] mm, THP, swap: Add swap cluster allocate/free functions Huang, Ying
@ 2016-09-08  5:49   ` Anshuman Khandual
  2016-09-08  8:30   ` Anshuman Khandual
  1 sibling, 0 replies; 60+ messages in thread
From: Anshuman Khandual @ 2016-09-08  5:49 UTC (permalink / raw)
  To: Huang, Ying, Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins,
	Shaohua Li, Minchan Kim, Rik van Riel
On 09/07/2016 10:16 PM, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> The swap cluster allocation/free functions are added based on the
> existing swap cluster management mechanism for SSD.  These functions
> don't work for the rotating hard disks because the existing swap cluster
> management mechanism doesn't work for them.  The hard disks support may
> be added if someone really need it.  But that needn't be included in
> this patchset.
> 
> This will be used for the THP (Transparent Huge Page) swap support.
> Where one swap cluster will hold the contents of each THP swapped out.
Which tree this series is based against ? This patch does not apply
on the mainline kernel today.
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64
  2016-09-07 16:46 ` [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64 Huang, Ying
  2016-09-08  5:45   ` Anshuman Khandual
@ 2016-09-08  8:21   ` Anshuman Khandual
  2016-09-08 11:03   ` Kirill A. Shutemov
  2016-09-08 11:07   ` Kirill A. Shutemov
  3 siblings, 0 replies; 60+ messages in thread
From: Anshuman Khandual @ 2016-09-08  8:21 UTC (permalink / raw)
  To: Huang, Ying, Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel
On 09/07/2016 10:16 PM, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> In this patch, the size of the swap cluster is changed to that of the
> THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
> the THP swap support on x86_64.  Where one swap cluster will be used to
> hold the contents of each THP swapped out.  And some information of the
> swapped out THP (such as compound map count) will be recorded in the
> swap_cluster_info data structure.
> 
> For other architectures which want THP swap support, THP_SWAP_CLUSTER
> need to be selected in the Kconfig file for the architecture.
> 
> In effect, this will enlarge swap cluster size by 2 times on x86_64.
> Which may make it harder to find a free cluster when the swap space
> becomes fragmented.  So that, this may reduce the continuous swap space
> allocation and sequential write in theory.  The performance test in 0day
> shows no regressions caused by this.
This patch needs to be split into two separate ones
(1) Add THP_SWAP_CLUSTER config option
(2) Enable CONFIG_THP_SWAP_CLUSTER for X86_64
The first patch should explain the proposal and the second patch
should have 86_64 arch specific details, regressions etc as already
been explained in the commit message.
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 03/10] mm, memcg: Support to charge/uncharge multiple swap entries
  2016-09-07 16:46 ` [PATCH -v3 03/10] mm, memcg: Support to charge/uncharge multiple swap entries Huang, Ying
  2016-09-08  5:46   ` Anshuman Khandual
@ 2016-09-08  8:28   ` Anshuman Khandual
  2016-09-08 18:15     ` Huang, Ying
  1 sibling, 1 reply; 60+ messages in thread
From: Anshuman Khandual @ 2016-09-08  8:28 UTC (permalink / raw)
  To: Huang, Ying, Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Tejun Heo,
	cgroups
On 09/07/2016 10:16 PM, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> This patch make it possible to charge or uncharge a set of continuous
> swap entries in the swap cgroup.  The number of swap entries is
> specified via an added parameter.
> 
> This will be used for the THP (Transparent Huge Page) swap support.
> Where a swap cluster backing a THP may be allocated and freed as a
> whole.  So a set of continuous swap entries (512 on x86_64) backing one
Please use HPAGE_SIZE / PAGE_SIZE instead of hard coded number like 512.
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 04/10] mm, THP, swap: Add swap cluster allocate/free functions
  2016-09-07 16:46 ` [PATCH -v3 04/10] mm, THP, swap: Add swap cluster allocate/free functions Huang, Ying
  2016-09-08  5:49   ` Anshuman Khandual
@ 2016-09-08  8:30   ` Anshuman Khandual
  2016-09-08 18:14     ` Huang, Ying
  1 sibling, 1 reply; 60+ messages in thread
From: Anshuman Khandual @ 2016-09-08  8:30 UTC (permalink / raw)
  To: Huang, Ying, Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins,
	Shaohua Li, Minchan Kim, Rik van Riel
On 09/07/2016 10:16 PM, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> The swap cluster allocation/free functions are added based on the
> existing swap cluster management mechanism for SSD.  These functions
> don't work for the rotating hard disks because the existing swap cluster
> management mechanism doesn't work for them.  The hard disks support may
> be added if someone really need it.  But that needn't be included in
> this patchset.
> 
> This will be used for the THP (Transparent Huge Page) swap support.
> Where one swap cluster will hold the contents of each THP swapped out.
Which tree this series is based against ? This patch does not apply
on the mainline kernel.
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 07/10] mm, THP, swap: Support to add/delete THP to/from swap cache
  2016-09-07 16:46 ` [PATCH -v3 07/10] mm, THP, swap: Support to add/delete THP to/from swap cache Huang, Ying
@ 2016-09-08  9:00   ` Anshuman Khandual
  2016-09-08 18:10     ` Huang, Ying
  0 siblings, 1 reply; 60+ messages in thread
From: Anshuman Khandual @ 2016-09-08  9:00 UTC (permalink / raw)
  To: Huang, Ying, Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov
On 09/07/2016 10:16 PM, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> With this patch, a THP (Transparent Huge Page) can be added/deleted
> to/from the swap cache as a set of sub-pages (512 on x86_64).
> 
> This will be used for the THP (Transparent Huge Page) swap support.
> Where one THP may be added/delted to/from the swap cache.  This will
> batch the swap cache operations to reduce the lock acquire/release times
> for the THP swap too.
> 
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Shaohua Li <shli@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> ---
>  include/linux/page-flags.h |  2 +-
>  mm/swap_state.c            | 57 +++++++++++++++++++++++++++++++---------------
>  2 files changed, 40 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 74e4dda..f5bcbea 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -314,7 +314,7 @@ PAGEFLAG_FALSE(HighMem)
>  #endif
>  
>  #ifdef CONFIG_SWAP
> -PAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
> +PAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
What is the reason for this change ? The commit message does not seem
to explain.
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64
  2016-09-07 16:46 ` [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64 Huang, Ying
  2016-09-08  5:45   ` Anshuman Khandual
  2016-09-08  8:21   ` Anshuman Khandual
@ 2016-09-08 11:03   ` Kirill A. Shutemov
  2016-09-08 17:39     ` Huang, Ying
  2016-09-08 11:07   ` Kirill A. Shutemov
  3 siblings, 1 reply; 60+ messages in thread
From: Kirill A. Shutemov @ 2016-09-08 11:03 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel
On Wed, Sep 07, 2016 at 09:46:00AM -0700, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> In this patch, the size of the swap cluster is changed to that of the
> THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
> the THP swap support on x86_64.  Where one swap cluster will be used to
> hold the contents of each THP swapped out.  And some information of the
> swapped out THP (such as compound map count) will be recorded in the
> swap_cluster_info data structure.
> 
> For other architectures which want THP swap support, THP_SWAP_CLUSTER
> need to be selected in the Kconfig file for the architecture.
> 
> In effect, this will enlarge swap cluster size by 2 times on x86_64.
> Which may make it harder to find a free cluster when the swap space
> becomes fragmented.  So that, this may reduce the continuous swap space
> allocation and sequential write in theory.  The performance test in 0day
> shows no regressions caused by this.
> 
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Shaohua Li <shli@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Rik van Riel <riel@redhat.com>
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> ---
>  arch/x86/Kconfig |  1 +
>  mm/Kconfig       | 13 +++++++++++++
>  mm/swapfile.c    |  4 ++++
>  3 files changed, 18 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 4c39728..421d862 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -164,6 +164,7 @@ config X86
>  	select HAVE_STACK_VALIDATION		if X86_64
>  	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
>  	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
> +	select ARCH_USES_THP_SWAP_CLUSTER	if X86_64
>  
>  config INSTRUCTION_DECODER
>  	def_bool y
> diff --git a/mm/Kconfig b/mm/Kconfig
> index be0ee11..2da8128 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -503,6 +503,19 @@ config FRONTSWAP
>  
>  	  If unsure, say Y to enable frontswap.
>  
> +config ARCH_USES_THP_SWAP_CLUSTER
> +	bool
> +	default n
> +
> +config THP_SWAP_CLUSTER
> +	bool
> +	depends on SWAP && TRANSPARENT_HUGEPAGE && ARCH_USES_THP_SWAP_CLUSTER
> +	default y
> +	help
> +	  Use one swap cluster to hold the contents of the THP
> +	  (Transparent Huge Page) swapped out.  The size of the swap
> +	  cluster will be same as that of THP.
> +
Why do we need to ask user about it? I don't think most users qualified to
make this decision.
>  config CMA
>  	bool "Contiguous Memory Allocator"
>  	depends on HAVE_MEMBLOCK && MMU
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 8f1b97d..4b78402 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -196,7 +196,11 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>  	}
>  }
>  
> +#ifdef CONFIG_THP_SWAP_CLUSTER
Just
#if defined(CONFIG_ARCH_USES_THP_SWAP_CLUSTER) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
would be enough from my POV.
> +#define SWAPFILE_CLUSTER	(HPAGE_SIZE / PAGE_SIZE)
> +#else
>  #define SWAPFILE_CLUSTER	256
> +#endif
>  #define LATENCY_LIMIT		256
>  
>  static inline void cluster_set_flag(struct swap_cluster_info *info,
> -- 
> 2.8.1
> 
-- 
 Kirill A. Shutemov
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64
  2016-09-07 16:46 ` [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64 Huang, Ying
                     ` (2 preceding siblings ...)
  2016-09-08 11:03   ` Kirill A. Shutemov
@ 2016-09-08 11:07   ` Kirill A. Shutemov
  2016-09-08 17:23     ` Huang, Ying
  3 siblings, 1 reply; 60+ messages in thread
From: Kirill A. Shutemov @ 2016-09-08 11:07 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel
On Wed, Sep 07, 2016 at 09:46:00AM -0700, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> In this patch, the size of the swap cluster is changed to that of the
> THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
> the THP swap support on x86_64.  Where one swap cluster will be used to
> hold the contents of each THP swapped out.  And some information of the
> swapped out THP (such as compound map count) will be recorded in the
> swap_cluster_info data structure.
> 
> For other architectures which want THP swap support, THP_SWAP_CLUSTER
> need to be selected in the Kconfig file for the architecture.
> 
> In effect, this will enlarge swap cluster size by 2 times on x86_64.
> Which may make it harder to find a free cluster when the swap space
> becomes fragmented.  So that, this may reduce the continuous swap space
> allocation and sequential write in theory.  The performance test in 0day
> shows no regressions caused by this.
> 
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Shaohua Li <shli@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Rik van Riel <riel@redhat.com>
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> ---
>  arch/x86/Kconfig |  1 +
>  mm/Kconfig       | 13 +++++++++++++
>  mm/swapfile.c    |  4 ++++
>  3 files changed, 18 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 4c39728..421d862 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -164,6 +164,7 @@ config X86
>  	select HAVE_STACK_VALIDATION		if X86_64
>  	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
>  	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
> +	select ARCH_USES_THP_SWAP_CLUSTER	if X86_64
>  
>  config INSTRUCTION_DECODER
>  	def_bool y
> diff --git a/mm/Kconfig b/mm/Kconfig
> index be0ee11..2da8128 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -503,6 +503,19 @@ config FRONTSWAP
>  
>  	  If unsure, say Y to enable frontswap.
>  
> +config ARCH_USES_THP_SWAP_CLUSTER
> +	bool
> +	default n
> +
> +config THP_SWAP_CLUSTER
> +	bool
> +	depends on SWAP && TRANSPARENT_HUGEPAGE && ARCH_USES_THP_SWAP_CLUSTER
> +	default y
> +	help
> +	  Use one swap cluster to hold the contents of the THP
> +	  (Transparent Huge Page) swapped out.  The size of the swap
> +	  cluster will be same as that of THP.
> +
>  config CMA
>  	bool "Contiguous Memory Allocator"
>  	depends on HAVE_MEMBLOCK && MMU
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 8f1b97d..4b78402 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -196,7 +196,11 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>  	}
>  }
>  
> +#ifdef CONFIG_THP_SWAP_CLUSTER
> +#define SWAPFILE_CLUSTER	(HPAGE_SIZE / PAGE_SIZE)
#define SWAPFILE_CLUSTER HPAGE_PMD_NR
Note, HPAGE_SIZE is not nessesary HPAGE_PMD_SIZE. I can imagine an arch
with multiple huge page sizes where HPAGE_SIZE differs from what is used
for THP.
> +#else
>  #define SWAPFILE_CLUSTER	256
> +#endif
>  #define LATENCY_LIMIT		256
>  
>  static inline void cluster_set_flag(struct swap_cluster_info *info,
> -- 
> 2.8.1
> 
-- 
 Kirill A. Shutemov
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 05/10] mm, THP, swap: Add get_huge_swap_page()
  2016-09-07 16:46 ` [PATCH -v3 05/10] mm, THP, swap: Add get_huge_swap_page() Huang, Ying
@ 2016-09-08 11:13   ` Kirill A. Shutemov
  2016-09-08 17:22     ` Huang, Ying
  0 siblings, 1 reply; 60+ messages in thread
From: Kirill A. Shutemov @ 2016-09-08 11:13 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	linux-mm, linux-kernel, Andrea Arcangeli, Kirill A . Shutemov,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel
On Wed, Sep 07, 2016 at 09:46:04AM -0700, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> A variation of get_swap_page(), get_huge_swap_page(), is added to
> allocate a swap cluster (512 swap slots) based on the swap cluster
> allocation function.  A fair simple algorithm is used, that is, only the
> first swap device in priority list will be tried to allocate the swap
> cluster.  The function will fail if the trying is not successful, and
> the caller will fallback to allocate a single swap slot instead.  This
> works good enough for normal cases.
For normal cases, yes. But the limitation is not obvious for users and
performance difference after small change in configuration could be
puzzling.
At least this must be documented somewhere.
> 
> This will be used for the THP (Transparent Huge Page) swap support.
> Where get_huge_swap_page() will be used to allocate one swap cluster for
> each THP swapped out.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Shaohua Li <shli@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Rik van Riel <riel@redhat.com>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> ---
>  include/linux/swap.h | 24 +++++++++++++++++++++++-
>  mm/swapfile.c        | 18 ++++++++++++------
>  2 files changed, 35 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 75aad24..bc0a84d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -399,7 +399,7 @@ static inline long get_nr_swap_pages(void)
>  }
>  
>  extern void si_swapinfo(struct sysinfo *);
> -extern swp_entry_t get_swap_page(void);
> +extern swp_entry_t __get_swap_page(bool huge);
>  extern swp_entry_t get_swap_page_of_type(int);
>  extern int add_swap_count_continuation(swp_entry_t, gfp_t);
>  extern void swap_shmem_alloc(swp_entry_t);
> @@ -419,6 +419,23 @@ extern bool reuse_swap_page(struct page *, int *);
>  extern int try_to_free_swap(struct page *);
>  struct backing_dev_info;
>  
> +static inline swp_entry_t get_swap_page(void)
> +{
> +	return __get_swap_page(false);
> +}
> +
> +#ifdef CONFIG_THP_SWAP_CLUSTER
> +static inline swp_entry_t get_huge_swap_page(void)
> +{
> +	return __get_swap_page(true);
> +}
> +#else
> +static inline swp_entry_t get_huge_swap_page(void)
> +{
> +	return (swp_entry_t) {0};
> +}
> +#endif
> +
>  #else /* CONFIG_SWAP */
>  
>  #define swap_address_space(entry)		(NULL)
> @@ -525,6 +542,11 @@ static inline swp_entry_t get_swap_page(void)
>  	return entry;
>  }
>  
> +static inline swp_entry_t get_huge_swap_page(void)
> +{
> +	return (swp_entry_t) {0};
> +}
> +
>  #endif /* CONFIG_SWAP */
>  
>  #ifdef CONFIG_MEMCG
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 0132e8c..3d2bd1f 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -760,14 +760,15 @@ static inline unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si)
>  }
>  #endif
>  
> -swp_entry_t get_swap_page(void)
> +swp_entry_t __get_swap_page(bool huge)
>  {
>  	struct swap_info_struct *si, *next;
>  	pgoff_t offset;
> +	int nr_pages = huge_cluster_nr_entries(huge);
>  
> -	if (atomic_long_read(&nr_swap_pages) <= 0)
> +	if (atomic_long_read(&nr_swap_pages) < nr_pages)
>  		goto noswap;
> -	atomic_long_dec(&nr_swap_pages);
> +	atomic_long_sub(nr_pages, &nr_swap_pages);
>  
>  	spin_lock(&swap_avail_lock);
>  
> @@ -795,10 +796,15 @@ start_over:
>  		}
>  
>  		/* This is called for allocating swap entry for cache */
> -		offset = scan_swap_map(si, SWAP_HAS_CACHE);
> +		if (likely(nr_pages == 1))
> +			offset = scan_swap_map(si, SWAP_HAS_CACHE);
> +		else
> +			offset = swap_alloc_huge_cluster(si);
>  		spin_unlock(&si->lock);
>  		if (offset)
>  			return swp_entry(si->type, offset);
> +		else if (unlikely(nr_pages != 1))
> +			goto fail_alloc;
>  		pr_debug("scan_swap_map of si %d failed to find offset\n",
>  		       si->type);
>  		spin_lock(&swap_avail_lock);
> @@ -818,8 +824,8 @@ nextsi:
>  	}
>  
>  	spin_unlock(&swap_avail_lock);
> -
> -	atomic_long_inc(&nr_swap_pages);
> +fail_alloc:
> +	atomic_long_add(nr_pages, &nr_swap_pages);
>  noswap:
>  	return (swp_entry_t) {0};
>  }
> -- 
> 2.8.1
> 
-- 
 Kirill A. Shutemov
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 08/10] mm, THP: Add can_split_huge_page()
  2016-09-07 16:46 ` [PATCH -v3 08/10] mm, THP: Add can_split_huge_page() Huang, Ying
@ 2016-09-08 11:17   ` Kirill A. Shutemov
  2016-09-08 17:02     ` Huang, Ying
  0 siblings, 1 reply; 60+ messages in thread
From: Kirill A. Shutemov @ 2016-09-08 11:17 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	linux-mm, linux-kernel, Andrea Arcangeli, Kirill A . Shutemov,
	Ebru Akagunduz
On Wed, Sep 07, 2016 at 09:46:07AM -0700, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> Separates checking whether we can split the huge page from
> split_huge_page_to_list() into a function.  This will help to check that
> before splitting the THP (Transparent Huge Page) really.
> 
> This will be used for delaying splitting THP during swapping out.  Where
> for a THP, we will allocate a swap cluster, add the THP into the swap
> cache, then split the THP.  To avoid the unnecessary operations for the
> un-splittable THP, we will check that firstly.
> 
> There is no functionality change in this patch.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> ---
>  include/linux/huge_mm.h |  6 ++++++
>  mm/huge_memory.c        | 13 ++++++++++++-
>  2 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 9b9f65d..a0073e7 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -94,6 +94,7 @@ extern unsigned long thp_get_unmapped_area(struct file *filp,
>  extern void prep_transhuge_page(struct page *page);
>  extern void free_transhuge_page(struct page *page);
>  
> +bool can_split_huge_page(struct page *page);
>  int split_huge_page_to_list(struct page *page, struct list_head *list);
>  static inline int split_huge_page(struct page *page)
>  {
> @@ -176,6 +177,11 @@ static inline void prep_transhuge_page(struct page *page) {}
>  
>  #define thp_get_unmapped_area	NULL
>  
> +static inline bool
> +can_split_huge_page(struct page *page)
> +{
BUILD_BUG() should be appropriate here.
> +	return false;
> +}
>  static inline int
>  split_huge_page_to_list(struct page *page, struct list_head *list)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fc0d37e..3be5abe 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2016,6 +2016,17 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
>  	return ret;
>  }
>  
> +/* Racy check whether the huge page can be split */
> +bool can_split_huge_page(struct page *page)
> +{
> +	int extra_pins = 0;
> +
> +	/* Additional pins from radix tree */
> +	if (!PageAnon(page))
> +		extra_pins = HPAGE_PMD_NR;
> +	return total_mapcount(page) == page_count(page) - extra_pins - 1;
> +}
> +
>  /*
>   * This function splits huge page into normal pages. @page can point to any
>   * subpage of huge page to split. Split doesn't change the position of @page.
> @@ -2086,7 +2097,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  	 * Racy check if we can split the page, before freeze_page() will
>  	 * split PMDs
>  	 */
> -	if (total_mapcount(head) != page_count(head) - extra_pins - 1) {
> +	if (!can_split_huge_page(head)) {
>  		ret = -EBUSY;
>  		goto out_unlock;
>  	}
> -- 
> 2.8.1
> 
-- 
 Kirill A. Shutemov
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 08/10] mm, THP: Add can_split_huge_page()
  2016-09-08 11:17   ` Kirill A. Shutemov
@ 2016-09-08 17:02     ` Huang, Ying
  0 siblings, 0 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-08 17:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A . Shutemov, Ebru Akagunduz
Hi, Kirill,
Thanks for your comments!
"Kirill A. Shutemov" <kirill@shutemov.name> writes:
> On Wed, Sep 07, 2016 at 09:46:07AM -0700, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> Separates checking whether we can split the huge page from
>> split_huge_page_to_list() into a function.  This will help to check that
>> before splitting the THP (Transparent Huge Page) really.
>> 
>> This will be used for delaying splitting THP during swapping out.  Where
>> for a THP, we will allocate a swap cluster, add the THP into the swap
>> cache, then split the THP.  To avoid the unnecessary operations for the
>> un-splittable THP, we will check that firstly.
>> 
>> There is no functionality change in this patch.
>> 
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> ---
>>  include/linux/huge_mm.h |  6 ++++++
>>  mm/huge_memory.c        | 13 ++++++++++++-
>>  2 files changed, 18 insertions(+), 1 deletion(-)
>> 
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 9b9f65d..a0073e7 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -94,6 +94,7 @@ extern unsigned long thp_get_unmapped_area(struct file *filp,
>>  extern void prep_transhuge_page(struct page *page);
>>  extern void free_transhuge_page(struct page *page);
>>  
>> +bool can_split_huge_page(struct page *page);
>>  int split_huge_page_to_list(struct page *page, struct list_head *list);
>>  static inline int split_huge_page(struct page *page)
>>  {
>> @@ -176,6 +177,11 @@ static inline void prep_transhuge_page(struct page *page) {}
>>  
>>  #define thp_get_unmapped_area	NULL
>>  
>> +static inline bool
>> +can_split_huge_page(struct page *page)
>> +{
>
> BUILD_BUG() should be appropriate here.
Yes.  Will add it.
>> +	return false;
>> +}
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 05/10] mm, THP, swap: Add get_huge_swap_page()
  2016-09-08 11:13   ` Kirill A. Shutemov
@ 2016-09-08 17:22     ` Huang, Ying
  0 siblings, 0 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-08 17:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A . Shutemov, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel
"Kirill A. Shutemov" <kirill@shutemov.name> writes:
> On Wed, Sep 07, 2016 at 09:46:04AM -0700, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> A variation of get_swap_page(), get_huge_swap_page(), is added to
>> allocate a swap cluster (512 swap slots) based on the swap cluster
>> allocation function.  A fair simple algorithm is used, that is, only the
>> first swap device in priority list will be tried to allocate the swap
>> cluster.  The function will fail if the trying is not successful, and
>> the caller will fallback to allocate a single swap slot instead.  This
>> works good enough for normal cases.
>
> For normal cases, yes. But the limitation is not obvious for users and
> performance difference after small change in configuration could be
> puzzling.
If the difference of the number of the free swap clusters among
multiple swap devices is significant, it is possible that some THP are
split earlier than necessary because we fail to allocate the swap
clusters for them.  For example, this could be caused by big size
difference among multiple swap devices.
> At least this must be documented somewhere.
I can add the above description in the patch description.  Any other
places do you suggest?
Best Regards,
Huang, Ying
[snip]
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64
  2016-09-08 11:07   ` Kirill A. Shutemov
@ 2016-09-08 17:23     ` Huang, Ying
  0 siblings, 0 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-08 17:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel
"Kirill A. Shutemov" <kirill@shutemov.name> writes:
> On Wed, Sep 07, 2016 at 09:46:00AM -0700, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> In this patch, the size of the swap cluster is changed to that of the
>> THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
>> the THP swap support on x86_64.  Where one swap cluster will be used to
>> hold the contents of each THP swapped out.  And some information of the
>> swapped out THP (such as compound map count) will be recorded in the
>> swap_cluster_info data structure.
>> 
>> For other architectures which want THP swap support, THP_SWAP_CLUSTER
>> need to be selected in the Kconfig file for the architecture.
>> 
>> In effect, this will enlarge swap cluster size by 2 times on x86_64.
>> Which may make it harder to find a free cluster when the swap space
>> becomes fragmented.  So that, this may reduce the continuous swap space
>> allocation and sequential write in theory.  The performance test in 0day
>> shows no regressions caused by this.
>> 
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Shaohua Li <shli@kernel.org>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Rik van Riel <riel@redhat.com>
>> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> ---
>>  arch/x86/Kconfig |  1 +
>>  mm/Kconfig       | 13 +++++++++++++
>>  mm/swapfile.c    |  4 ++++
>>  3 files changed, 18 insertions(+)
>> 
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 4c39728..421d862 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -164,6 +164,7 @@ config X86
>>  	select HAVE_STACK_VALIDATION		if X86_64
>>  	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
>>  	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
>> +	select ARCH_USES_THP_SWAP_CLUSTER	if X86_64
>>  
>>  config INSTRUCTION_DECODER
>>  	def_bool y
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index be0ee11..2da8128 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -503,6 +503,19 @@ config FRONTSWAP
>>  
>>  	  If unsure, say Y to enable frontswap.
>>  
>> +config ARCH_USES_THP_SWAP_CLUSTER
>> +	bool
>> +	default n
>> +
>> +config THP_SWAP_CLUSTER
>> +	bool
>> +	depends on SWAP && TRANSPARENT_HUGEPAGE && ARCH_USES_THP_SWAP_CLUSTER
>> +	default y
>> +	help
>> +	  Use one swap cluster to hold the contents of the THP
>> +	  (Transparent Huge Page) swapped out.  The size of the swap
>> +	  cluster will be same as that of THP.
>> +
>>  config CMA
>>  	bool "Contiguous Memory Allocator"
>>  	depends on HAVE_MEMBLOCK && MMU
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 8f1b97d..4b78402 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -196,7 +196,11 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>>  	}
>>  }
>>  
>> +#ifdef CONFIG_THP_SWAP_CLUSTER
>> +#define SWAPFILE_CLUSTER	(HPAGE_SIZE / PAGE_SIZE)
>
> #define SWAPFILE_CLUSTER HPAGE_PMD_NR
Yes.  Will change it.
> Note, HPAGE_SIZE is not nessesary HPAGE_PMD_SIZE. I can imagine an arch
> with multiple huge page sizes where HPAGE_SIZE differs from what is used
> for THP.
Thanks for pointing out that!
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64
  2016-09-08 11:03   ` Kirill A. Shutemov
@ 2016-09-08 17:39     ` Huang, Ying
  0 siblings, 0 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-08 17:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel
"Kirill A. Shutemov" <kirill@shutemov.name> writes:
> On Wed, Sep 07, 2016 at 09:46:00AM -0700, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> In this patch, the size of the swap cluster is changed to that of the
>> THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
>> the THP swap support on x86_64.  Where one swap cluster will be used to
>> hold the contents of each THP swapped out.  And some information of the
>> swapped out THP (such as compound map count) will be recorded in the
>> swap_cluster_info data structure.
>> 
>> For other architectures which want THP swap support, THP_SWAP_CLUSTER
>> need to be selected in the Kconfig file for the architecture.
>> 
>> In effect, this will enlarge swap cluster size by 2 times on x86_64.
>> Which may make it harder to find a free cluster when the swap space
>> becomes fragmented.  So that, this may reduce the continuous swap space
>> allocation and sequential write in theory.  The performance test in 0day
>> shows no regressions caused by this.
>> 
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Shaohua Li <shli@kernel.org>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Rik van Riel <riel@redhat.com>
>> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> ---
>>  arch/x86/Kconfig |  1 +
>>  mm/Kconfig       | 13 +++++++++++++
>>  mm/swapfile.c    |  4 ++++
>>  3 files changed, 18 insertions(+)
>> 
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 4c39728..421d862 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -164,6 +164,7 @@ config X86
>>  	select HAVE_STACK_VALIDATION		if X86_64
>>  	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
>>  	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
>> +	select ARCH_USES_THP_SWAP_CLUSTER	if X86_64
>>  
>>  config INSTRUCTION_DECODER
>>  	def_bool y
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index be0ee11..2da8128 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -503,6 +503,19 @@ config FRONTSWAP
>>  
>>  	  If unsure, say Y to enable frontswap.
>>  
>> +config ARCH_USES_THP_SWAP_CLUSTER
>> +	bool
>> +	default n
>> +
>> +config THP_SWAP_CLUSTER
>> +	bool
>> +	depends on SWAP && TRANSPARENT_HUGEPAGE && ARCH_USES_THP_SWAP_CLUSTER
>> +	default y
>> +	help
>> +	  Use one swap cluster to hold the contents of the THP
>> +	  (Transparent Huge Page) swapped out.  The size of the swap
>> +	  cluster will be same as that of THP.
>> +
>
> Why do we need to ask user about it? I don't think most users qualified to
> make this decision.
Users need not to choose this.  If the dependencies is true, it will be
turned on.  I added the help here not for users, but for developers to
know what it is for.
>>  config CMA
>>  	bool "Contiguous Memory Allocator"
>>  	depends on HAVE_MEMBLOCK && MMU
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 8f1b97d..4b78402 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -196,7 +196,11 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>>  	}
>>  }
>>  
>> +#ifdef CONFIG_THP_SWAP_CLUSTER
>
> Just
>
> #if defined(CONFIG_ARCH_USES_THP_SWAP_CLUSTER) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>
> would be enough from my POV.
That works.  I added a new configuration option just to save some typing
and make it a little easier to read.  If other people think it is not
necessary to add a new configuration option for that too.  I will use
change it in this way.
Best Regards,
Huang, Ying
>> +#define SWAPFILE_CLUSTER	(HPAGE_SIZE / PAGE_SIZE)
>> +#else
>>  #define SWAPFILE_CLUSTER	256
>> +#endif
>>  #define LATENCY_LIMIT		256
>>  
>>  static inline void cluster_set_flag(struct swap_cluster_info *info,
>> -- 
>> 2.8.1
>> 
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64
  2016-09-08  5:45   ` Anshuman Khandual
@ 2016-09-08 18:07     ` Huang, Ying
  2016-09-19 17:09     ` Johannes Weiner
  1 sibling, 0 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-08 18:07 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel
Anshuman Khandual <khandual@linux.vnet.ibm.com> writes:
> On 09/07/2016 10:16 PM, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> In this patch, the size of the swap cluster is changed to that of the
>> THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
>> the THP swap support on x86_64.  Where one swap cluster will be used to
>> hold the contents of each THP swapped out.  And some information of the
>> swapped out THP (such as compound map count) will be recorded in the
>> swap_cluster_info data structure.
>> 
>> For other architectures which want THP swap support, THP_SWAP_CLUSTER
>> need to be selected in the Kconfig file for the architecture.
>> 
>> In effect, this will enlarge swap cluster size by 2 times on x86_64.
>> Which may make it harder to find a free cluster when the swap space
>> becomes fragmented.  So that, this may reduce the continuous swap space
>> allocation and sequential write in theory.  The performance test in 0day
>> shows no regressions caused by this.
>
> This patch needs to be split into two separate ones
>
> (1) Add THP_SWAP_CLUSTER config option
> (2) Enable CONFIG_THP_SWAP_CLUSTER for X86_64
>
> The first patch should explain the proposal and the second patch
> should have 86_64 arch specific details, regressions etc as already
> been explained in the commit message.
The code change and possible issues is not x86_64 specific, but general
for all architectures where the config option is enabled.  If so, the
second patch becomes 1 line kconfig change and no much to be said in
patch description.  Does it deserve a separate patch?
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 07/10] mm, THP, swap: Support to add/delete THP to/from swap cache
  2016-09-08  9:00   ` Anshuman Khandual
@ 2016-09-08 18:10     ` Huang, Ying
  0 siblings, 0 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-08 18:10 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov
Hi, Anshuman,
Thanks for comments!
Anshuman Khandual <khandual@linux.vnet.ibm.com> writes:
> On 09/07/2016 10:16 PM, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> With this patch, a THP (Transparent Huge Page) can be added/deleted
>> to/from the swap cache as a set of sub-pages (512 on x86_64).
>> 
>> This will be used for the THP (Transparent Huge Page) swap support.
>> Where one THP may be added/delted to/from the swap cache.  This will
>> batch the swap cache operations to reduce the lock acquire/release times
>> for the THP swap too.
>> 
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Shaohua Li <shli@kernel.org>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> ---
>>  include/linux/page-flags.h |  2 +-
>>  mm/swap_state.c            | 57 +++++++++++++++++++++++++++++++---------------
>>  2 files changed, 40 insertions(+), 19 deletions(-)
>> 
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index 74e4dda..f5bcbea 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -314,7 +314,7 @@ PAGEFLAG_FALSE(HighMem)
>>  #endif
>>  
>>  #ifdef CONFIG_SWAP
>> -PAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
>> +PAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
>
> What is the reason for this change ? The commit message does not seem
> to explain.
Before this change, SetPageSwapCache() cannot be called for THP, after
the change, SetPageSwapCache() could be called for the head page of the
THP, but not the tail pages.  Because we will never do that before this
patch series.
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 04/10] mm, THP, swap: Add swap cluster allocate/free functions
  2016-09-08  8:30   ` Anshuman Khandual
@ 2016-09-08 18:14     ` Huang, Ying
  0 siblings, 0 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-08 18:14 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A . Shutemov, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel
Anshuman Khandual <khandual@linux.vnet.ibm.com> writes:
> On 09/07/2016 10:16 PM, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> The swap cluster allocation/free functions are added based on the
>> existing swap cluster management mechanism for SSD.  These functions
>> don't work for the rotating hard disks because the existing swap cluster
>> management mechanism doesn't work for them.  The hard disks support may
>> be added if someone really need it.  But that needn't be included in
>> this patchset.
>> 
>> This will be used for the THP (Transparent Huge Page) swap support.
>> Where one swap cluster will hold the contents of each THP swapped out.
>
> Which tree this series is based against ? This patch does not apply
> on the mainline kernel.
This series is based on 8/31 head of mmotm/master.  I stated it in
00/10, but I know it is hided inside other text and not obvious at all.
Is there some way to make it obvious?
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 03/10] mm, memcg: Support to charge/uncharge multiple swap entries
  2016-09-08  8:28   ` Anshuman Khandual
@ 2016-09-08 18:15     ` Huang, Ying
  0 siblings, 0 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-08 18:15 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A . Shutemov, Vladimir Davydov, Johannes Weiner,
	Michal Hocko, Tejun Heo, cgroups
Anshuman Khandual <khandual@linux.vnet.ibm.com> writes:
> On 09/07/2016 10:16 PM, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> This patch make it possible to charge or uncharge a set of continuous
>> swap entries in the swap cgroup.  The number of swap entries is
>> specified via an added parameter.
>> 
>> This will be used for the THP (Transparent Huge Page) swap support.
>> Where a swap cluster backing a THP may be allocated and freed as a
>> whole.  So a set of continuous swap entries (512 on x86_64) backing one
>
> Please use HPAGE_SIZE / PAGE_SIZE instead of hard coded number like 512.
Sure.  Will change it.
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-07 16:45 [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (9 preceding siblings ...)
  2016-09-07 16:46 ` [PATCH -v3 10/10] mm, THP, swap: Delay splitting THP during swap out Huang, Ying
@ 2016-09-09  5:43 ` Minchan Kim
  2016-09-09 15:53   ` Tim Chen
  2016-09-09 20:35   ` Huang, Ying
  2016-09-19 17:33 ` Hugh Dickins
  2016-09-22 22:56 ` Shaohua Li
  12 siblings, 2 replies; 60+ messages in thread
From: Minchan Kim @ 2016-09-09  5:43 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko
Hi Huang,
On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> This patchset is to optimize the performance of Transparent Huge Page
> (THP) swap.
> 
> Hi, Andrew, could you help me to check whether the overall design is
> reasonable?
> 
> Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
> swap part of the patchset?  Especially [01/10], [04/10], [05/10],
> [06/10], [07/10], [10/10].
> 
> Hi, Andrea and Kirill, could you help me to review the THP part of the
> patchset?  Especially [02/10], [03/10], [09/10] and [10/10].
> 
> Hi, Johannes, Michal and Vladimir, I am not very confident about the
> memory cgroup part, especially [02/10] and [03/10].  Could you help me
> to review it?
> 
> And for all, Any comment is welcome!
> 
> 
> Recently, the performance of the storage devices improved so fast that
> we cannot saturate the disk bandwidth when do page swap out even on a
> high-end server machine.  Because the performance of the storage
> device improved faster than that of CPU.  And it seems that the trend
> will not change in the near future.  On the other hand, the THP
> becomes more and more popular because of increased memory size.  So it
> becomes necessary to optimize THP swap performance.
> 
> The advantages of the THP swap support include:
> 
> - Batch the swap operations for the THP to reduce lock
>   acquiring/releasing, including allocating/freeing the swap space,
>   adding/deleting to/from the swap cache, and writing/reading the swap
>   space, etc.  This will help improve the performance of the THP swap.
> 
> - The THP swap space read/write will be 2M sequential IO.  It is
>   particularly helpful for the swap read, which usually are 4k random
>   IO.  This will improve the performance of the THP swap too.
> 
> - It will help the memory fragmentation, especially when the THP is
>   heavily used by the applications.  The 2M continuous pages will be
>   free up after THP swapping out.
I just read patchset right now and still doubt why the all changes
should be coupled with THP tightly. Many parts(e.g., you introduced
or modifying existing functions for making them THP specific) could
just take page_list and the number of pages then would handle them
without THP awareness.
For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
can try to allocate new cluster. With that, we could allocate new
clusters to meet nr_pages requested or bail out if we fail to allocate
and fallback to 0-order page swapout. With that, swap layer could
support multiple order-0 pages by batch.
IMO, I really want to land Tim Chen's batching swapout work first.
With Tim Chen's work, I expect we can make better refactoring
for batching swap before adding more confuse to the swap layer.
(I expect it would share several pieces of code for or would be base
for batching allocation of swapcache, swapslot)
After that, we could enhance swap for big contiguous batching
like THP and finally we might make it be aware of THP specific to
enhance further.
A thing I remember you aruged: you want to swapin 512 pages
all at once unconditionally. It's really worth to discuss if
your design is going for the way.
I doubt it's generally good idea. Because, currently, we try to
swap in swapped out pages in THP page with conservative approach
but your direction is going to opposite way.
[mm, thp: convert from optimistic swapin collapsing to conservative]
I think general approach(i.e., less effective than targeting
implement for your own specific goal but less hacky and better job
for many cases) is to rely/improve on the swap readahead.
If most of subpages of a THP page are really workingset, swap readahead
could work well.
Yeah, it's fairly vague feedback so sorry if I miss something clear.
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-09  5:43 ` [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Minchan Kim
@ 2016-09-09 15:53   ` Tim Chen
  2016-09-09 20:35   ` Huang, Ying
  1 sibling, 0 replies; 60+ messages in thread
From: Tim Chen @ 2016-09-09 15:53 UTC (permalink / raw)
  To: Minchan Kim, Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko
On Fri, 2016-09-09 at 14:43 +0900, Minchan Kim wrote:
> Hi Huang,
> 
> On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
> > 
> > From: Huang Ying <ying.huang@intel.com>
> > 
> > This patchset is to optimize the performance of Transparent Huge Page
> > (THP) swap.
> > 
> > Hi, Andrew, could you help me to check whether the overall design is
> > reasonable?
> > 
> > Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
> > swap part of the patchset?  Especially [01/10], [04/10], [05/10],
> > [06/10], [07/10], [10/10].
> > 
> > Hi, Andrea and Kirill, could you help me to review the THP part of the
> > patchset?  Especially [02/10], [03/10], [09/10] and [10/10].
> > 
> > Hi, Johannes, Michal and Vladimir, I am not very confident about the
> > memory cgroup part, especially [02/10] and [03/10].  Could you help me
> > to review it?
> > 
> > And for all, Any comment is welcome!
> > 
> > 
> > Recently, the performance of the storage devices improved so fast that
> > we cannot saturate the disk bandwidth when do page swap out even on a
> > high-end server machine.  Because the performance of the storage
> > device improved faster than that of CPU.  And it seems that the trend
> > will not change in the near future.  On the other hand, the THP
> > becomes more and more popular because of increased memory size.  So it
> > becomes necessary to optimize THP swap performance.
> > 
> > The advantages of the THP swap support include:
> > 
> > - Batch the swap operations for the THP to reduce lock
> >   acquiring/releasing, including allocating/freeing the swap space,
> >   adding/deleting to/from the swap cache, and writing/reading the swap
> >   space, etc.  This will help improve the performance of the THP swap.
> > 
> > - The THP swap space read/write will be 2M sequential IO.  It is
> >   particularly helpful for the swap read, which usually are 4k random
> >   IO.  This will improve the performance of the THP swap too.
> > 
> > - It will help the memory fragmentation, especially when the THP is
> >   heavily used by the applications.  The 2M continuous pages will be
> >   free up after THP swapping out.
> I just read patchset right now and still doubt why the all changes
> should be coupled with THP tightly. Many parts(e.g., you introduced
> or modifying existing functions for making them THP specific) could
> just take page_list and the number of pages then would handle them
> without THP awareness.
> 
> For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
> can try to allocate new cluster. With that, we could allocate new
> clusters to meet nr_pages requested or bail out if we fail to allocate
> and fallback to 0-order page swapout. With that, swap layer could
> support multiple order-0 pages by batch.
> 
> IMO, I really want to land Tim Chen's batching swapout work first.
> With Tim Chen's work, I expect we can make better refactoring
> for batching swap before adding more confuse to the swap layer.
> (I expect it would share several pieces of code for or would be base
> for batching allocation of swapcache, swapslot)
Minchan,
Ying and I do plan to send out a new patch series on batching swapout
and swapin plus a few other optimization on the swapping of 
regular sized pages.
Hopefully we'll be able to do that soon after we fixed up a few
things and retest.
Tim
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-09  5:43 ` [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Minchan Kim
  2016-09-09 15:53   ` Tim Chen
@ 2016-09-09 20:35   ` Huang, Ying
  2016-09-13  6:13     ` Minchan Kim
  1 sibling, 1 reply; 60+ messages in thread
From: Huang, Ying @ 2016-09-09 20:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko
Hi, Minchan,
Minchan Kim <minchan@kernel.org> writes:
> Hi Huang,
>
> On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> This patchset is to optimize the performance of Transparent Huge Page
>> (THP) swap.
>> 
>> Hi, Andrew, could you help me to check whether the overall design is
>> reasonable?
>> 
>> Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
>> swap part of the patchset?  Especially [01/10], [04/10], [05/10],
>> [06/10], [07/10], [10/10].
>> 
>> Hi, Andrea and Kirill, could you help me to review the THP part of the
>> patchset?  Especially [02/10], [03/10], [09/10] and [10/10].
>> 
>> Hi, Johannes, Michal and Vladimir, I am not very confident about the
>> memory cgroup part, especially [02/10] and [03/10].  Could you help me
>> to review it?
>> 
>> And for all, Any comment is welcome!
>> 
>> 
>> Recently, the performance of the storage devices improved so fast that
>> we cannot saturate the disk bandwidth when do page swap out even on a
>> high-end server machine.  Because the performance of the storage
>> device improved faster than that of CPU.  And it seems that the trend
>> will not change in the near future.  On the other hand, the THP
>> becomes more and more popular because of increased memory size.  So it
>> becomes necessary to optimize THP swap performance.
>> 
>> The advantages of the THP swap support include:
>> 
>> - Batch the swap operations for the THP to reduce lock
>>   acquiring/releasing, including allocating/freeing the swap space,
>>   adding/deleting to/from the swap cache, and writing/reading the swap
>>   space, etc.  This will help improve the performance of the THP swap.
>> 
>> - The THP swap space read/write will be 2M sequential IO.  It is
>>   particularly helpful for the swap read, which usually are 4k random
>>   IO.  This will improve the performance of the THP swap too.
>> 
>> - It will help the memory fragmentation, especially when the THP is
>>   heavily used by the applications.  The 2M continuous pages will be
>>   free up after THP swapping out.
>
> I just read patchset right now and still doubt why the all changes
> should be coupled with THP tightly. Many parts(e.g., you introduced
> or modifying existing functions for making them THP specific) could
> just take page_list and the number of pages then would handle them
> without THP awareness.
I am glad if my change could help normal pages swapping too.  And we can
change these functions to work for normal pages when necessary.
> For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
> can try to allocate new cluster. With that, we could allocate new
> clusters to meet nr_pages requested or bail out if we fail to allocate
> and fallback to 0-order page swapout. With that, swap layer could
> support multiple order-0 pages by batch.
>
> IMO, I really want to land Tim Chen's batching swapout work first.
> With Tim Chen's work, I expect we can make better refactoring
> for batching swap before adding more confuse to the swap layer.
> (I expect it would share several pieces of code for or would be base
> for batching allocation of swapcache, swapslot)
I don't think there is hard conflict between normal pages swapping
optimizing and THP swap optimizing.  Some code may be shared between
them.  That is good for both sides.
> After that, we could enhance swap for big contiguous batching
> like THP and finally we might make it be aware of THP specific to
> enhance further.
>
> A thing I remember you aruged: you want to swapin 512 pages
> all at once unconditionally. It's really worth to discuss if
> your design is going for the way.
> I doubt it's generally good idea. Because, currently, we try to
> swap in swapped out pages in THP page with conservative approach
> but your direction is going to opposite way.
>
> [mm, thp: convert from optimistic swapin collapsing to conservative]
>
> I think general approach(i.e., less effective than targeting
> implement for your own specific goal but less hacky and better job
> for many cases) is to rely/improve on the swap readahead.
> If most of subpages of a THP page are really workingset, swap readahead
> could work well.
>
> Yeah, it's fairly vague feedback so sorry if I miss something clear.
Yes.  I want to go to the direction that to swap in 512 pages together.
And I think it is a good opportunity to discuss that now.  The advantages
of swapping in 512 pages together are:
- Improve the performance of swapping in IO via turning small read size
  into 512 pages big read size.
- Keep THP across swap out/in.  With the memory size become more and
  more large, the 4k pages bring more and more burden to memory
  management.  One solution is to use 2M pages as much as possible, that
  will reduce the management burden greatly, such as much reduced length
  of LRU list, etc.
The disadvantage are:
- Increase the memory pressure when swap in THP.
- Some pages swapped in may not needed in the near future.
Because of the disadvantages, the 512 pages swapping in should be made
optional.  But I don't think we should make it impossible.
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-09 20:35   ` Huang, Ying
@ 2016-09-13  6:13     ` Minchan Kim
  2016-09-13  6:40       ` Huang, Ying
  0 siblings, 1 reply; 60+ messages in thread
From: Minchan Kim @ 2016-09-13  6:13 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko
Hi Huang,
On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
< snip >
> >> Recently, the performance of the storage devices improved so fast that
> >> we cannot saturate the disk bandwidth when do page swap out even on a
> >> high-end server machine.  Because the performance of the storage
> >> device improved faster than that of CPU.  And it seems that the trend
> >> will not change in the near future.  On the other hand, the THP
> >> becomes more and more popular because of increased memory size.  So it
> >> becomes necessary to optimize THP swap performance.
> >> 
> >> The advantages of the THP swap support include:
> >> 
> >> - Batch the swap operations for the THP to reduce lock
> >>   acquiring/releasing, including allocating/freeing the swap space,
> >>   adding/deleting to/from the swap cache, and writing/reading the swap
> >>   space, etc.  This will help improve the performance of the THP swap.
> >> 
> >> - The THP swap space read/write will be 2M sequential IO.  It is
> >>   particularly helpful for the swap read, which usually are 4k random
> >>   IO.  This will improve the performance of the THP swap too.
> >> 
> >> - It will help the memory fragmentation, especially when the THP is
> >>   heavily used by the applications.  The 2M continuous pages will be
> >>   free up after THP swapping out.
> >
> > I just read patchset right now and still doubt why the all changes
> > should be coupled with THP tightly. Many parts(e.g., you introduced
> > or modifying existing functions for making them THP specific) could
> > just take page_list and the number of pages then would handle them
> > without THP awareness.
> 
> I am glad if my change could help normal pages swapping too.  And we can
> change these functions to work for normal pages when necessary.
Sure but it would be less painful that THP awareness swapout is
based on multiple normal pages swapout. For exmaple, we don't
touch delay THP split part(i.e., split a THP into 512 pages like
as-is) and enhances swapout further like Tim's suggestion
for mulitple normal pages swapout. With that, it might be enough
for fast-storage without needing THP awareness.
My *point* is let's approach step by step.
First of all, go with batching normal pages swapout and if it's
not enough, dive into further optimization like introducing
THP-aware swapout.
I believe it's natural development process to evolve things
without over-engineering.
> 
> > For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
> > can try to allocate new cluster. With that, we could allocate new
> > clusters to meet nr_pages requested or bail out if we fail to allocate
> > and fallback to 0-order page swapout. With that, swap layer could
> > support multiple order-0 pages by batch.
> >
> > IMO, I really want to land Tim Chen's batching swapout work first.
> > With Tim Chen's work, I expect we can make better refactoring
> > for batching swap before adding more confuse to the swap layer.
> > (I expect it would share several pieces of code for or would be base
> > for batching allocation of swapcache, swapslot)
> 
> I don't think there is hard conflict between normal pages swapping
> optimizing and THP swap optimizing.  Some code may be shared between
> them.  That is good for both sides.
> 
> > After that, we could enhance swap for big contiguous batching
> > like THP and finally we might make it be aware of THP specific to
> > enhance further.
> >
> > A thing I remember you aruged: you want to swapin 512 pages
> > all at once unconditionally. It's really worth to discuss if
> > your design is going for the way.
> > I doubt it's generally good idea. Because, currently, we try to
> > swap in swapped out pages in THP page with conservative approach
> > but your direction is going to opposite way.
> >
> > [mm, thp: convert from optimistic swapin collapsing to conservative]
> >
> > I think general approach(i.e., less effective than targeting
> > implement for your own specific goal but less hacky and better job
> > for many cases) is to rely/improve on the swap readahead.
> > If most of subpages of a THP page are really workingset, swap readahead
> > could work well.
> >
> > Yeah, it's fairly vague feedback so sorry if I miss something clear.
> 
> Yes.  I want to go to the direction that to swap in 512 pages together.
> And I think it is a good opportunity to discuss that now.  The advantages
> of swapping in 512 pages together are:
> 
> - Improve the performance of swapping in IO via turning small read size
>   into 512 pages big read size.
> 
> - Keep THP across swap out/in.  With the memory size become more and
>   more large, the 4k pages bring more and more burden to memory
>   management.  One solution is to use 2M pages as much as possible, that
>   will reduce the management burden greatly, such as much reduced length
>   of LRU list, etc.
> 
> The disadvantage are:
> 
> - Increase the memory pressure when swap in THP.
> 
> - Some pages swapped in may not needed in the near future.
> 
> Because of the disadvantages, the 512 pages swapping in should be made
> optional.  But I don't think we should make it impossible.
Yeb. No need to make it impossible but your design shouldn't be coupled
with non-existing feature yet.
> 
> Best Regards,
> Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-13  6:13     ` Minchan Kim
@ 2016-09-13  6:40       ` Huang, Ying
  2016-09-13  7:05         ` Minchan Kim
  0 siblings, 1 reply; 60+ messages in thread
From: Huang, Ying @ 2016-09-13  6:40 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko
Minchan Kim <minchan@kernel.org> writes:
> Hi Huang,
>
> On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
>
> < snip >
>
>> >> Recently, the performance of the storage devices improved so fast that
>> >> we cannot saturate the disk bandwidth when do page swap out even on a
>> >> high-end server machine.  Because the performance of the storage
>> >> device improved faster than that of CPU.  And it seems that the trend
>> >> will not change in the near future.  On the other hand, the THP
>> >> becomes more and more popular because of increased memory size.  So it
>> >> becomes necessary to optimize THP swap performance.
>> >> 
>> >> The advantages of the THP swap support include:
>> >> 
>> >> - Batch the swap operations for the THP to reduce lock
>> >>   acquiring/releasing, including allocating/freeing the swap space,
>> >>   adding/deleting to/from the swap cache, and writing/reading the swap
>> >>   space, etc.  This will help improve the performance of the THP swap.
>> >> 
>> >> - The THP swap space read/write will be 2M sequential IO.  It is
>> >>   particularly helpful for the swap read, which usually are 4k random
>> >>   IO.  This will improve the performance of the THP swap too.
>> >> 
>> >> - It will help the memory fragmentation, especially when the THP is
>> >>   heavily used by the applications.  The 2M continuous pages will be
>> >>   free up after THP swapping out.
>> >
>> > I just read patchset right now and still doubt why the all changes
>> > should be coupled with THP tightly. Many parts(e.g., you introduced
>> > or modifying existing functions for making them THP specific) could
>> > just take page_list and the number of pages then would handle them
>> > without THP awareness.
>> 
>> I am glad if my change could help normal pages swapping too.  And we can
>> change these functions to work for normal pages when necessary.
>
> Sure but it would be less painful that THP awareness swapout is
> based on multiple normal pages swapout. For exmaple, we don't
> touch delay THP split part(i.e., split a THP into 512 pages like
> as-is) and enhances swapout further like Tim's suggestion
> for mulitple normal pages swapout. With that, it might be enough
> for fast-storage without needing THP awareness.
>
> My *point* is let's approach step by step.
> First of all, go with batching normal pages swapout and if it's
> not enough, dive into further optimization like introducing
> THP-aware swapout.
>
> I believe it's natural development process to evolve things
> without over-engineering.
My target is not only the THP swap out acceleration, but also the full
THP swap out/in support without splitting THP.  This patchset is just
the first step of the full THP swap support.
>> > For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
>> > can try to allocate new cluster. With that, we could allocate new
>> > clusters to meet nr_pages requested or bail out if we fail to allocate
>> > and fallback to 0-order page swapout. With that, swap layer could
>> > support multiple order-0 pages by batch.
>> >
>> > IMO, I really want to land Tim Chen's batching swapout work first.
>> > With Tim Chen's work, I expect we can make better refactoring
>> > for batching swap before adding more confuse to the swap layer.
>> > (I expect it would share several pieces of code for or would be base
>> > for batching allocation of swapcache, swapslot)
>> 
>> I don't think there is hard conflict between normal pages swapping
>> optimizing and THP swap optimizing.  Some code may be shared between
>> them.  That is good for both sides.
>> 
>> > After that, we could enhance swap for big contiguous batching
>> > like THP and finally we might make it be aware of THP specific to
>> > enhance further.
>> >
>> > A thing I remember you aruged: you want to swapin 512 pages
>> > all at once unconditionally. It's really worth to discuss if
>> > your design is going for the way.
>> > I doubt it's generally good idea. Because, currently, we try to
>> > swap in swapped out pages in THP page with conservative approach
>> > but your direction is going to opposite way.
>> >
>> > [mm, thp: convert from optimistic swapin collapsing to conservative]
>> >
>> > I think general approach(i.e., less effective than targeting
>> > implement for your own specific goal but less hacky and better job
>> > for many cases) is to rely/improve on the swap readahead.
>> > If most of subpages of a THP page are really workingset, swap readahead
>> > could work well.
>> >
>> > Yeah, it's fairly vague feedback so sorry if I miss something clear.
>> 
>> Yes.  I want to go to the direction that to swap in 512 pages together.
>> And I think it is a good opportunity to discuss that now.  The advantages
>> of swapping in 512 pages together are:
>> 
>> - Improve the performance of swapping in IO via turning small read size
>>   into 512 pages big read size.
>> 
>> - Keep THP across swap out/in.  With the memory size become more and
>>   more large, the 4k pages bring more and more burden to memory
>>   management.  One solution is to use 2M pages as much as possible, that
>>   will reduce the management burden greatly, such as much reduced length
>>   of LRU list, etc.
>> 
>> The disadvantage are:
>> 
>> - Increase the memory pressure when swap in THP.
>> 
>> - Some pages swapped in may not needed in the near future.
>> 
>> Because of the disadvantages, the 512 pages swapping in should be made
>> optional.  But I don't think we should make it impossible.
>
> Yeb. No need to make it impossible but your design shouldn't be coupled
> with non-existing feature yet.
Sorry, what is the "non-existing feature"?  The full THP swap out/in
support without splitting THP?  If so, this patchset is the just the
first step of that.  I plan to finish the the full THP swap out/in
support in 3 steps:
1. Delay splitting the THP after adding it into swap cache
2. Delay splitting the THP after swapping out being completed
3. Avoid splitting the THP during swap out, and swap in the full THP if
   possible
I plan to do it step by step to make it easier to review the code.
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-13  6:40       ` Huang, Ying
@ 2016-09-13  7:05         ` Minchan Kim
  2016-09-13  8:53           ` Huang, Ying
  0 siblings, 1 reply; 60+ messages in thread
From: Minchan Kim @ 2016-09-13  7:05 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko
On Tue, Sep 13, 2016 at 02:40:00PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > Hi Huang,
> >
> > On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
> >
> > < snip >
> >
> >> >> Recently, the performance of the storage devices improved so fast that
> >> >> we cannot saturate the disk bandwidth when do page swap out even on a
> >> >> high-end server machine.  Because the performance of the storage
> >> >> device improved faster than that of CPU.  And it seems that the trend
> >> >> will not change in the near future.  On the other hand, the THP
> >> >> becomes more and more popular because of increased memory size.  So it
> >> >> becomes necessary to optimize THP swap performance.
> >> >> 
> >> >> The advantages of the THP swap support include:
> >> >> 
> >> >> - Batch the swap operations for the THP to reduce lock
> >> >>   acquiring/releasing, including allocating/freeing the swap space,
> >> >>   adding/deleting to/from the swap cache, and writing/reading the swap
> >> >>   space, etc.  This will help improve the performance of the THP swap.
> >> >> 
> >> >> - The THP swap space read/write will be 2M sequential IO.  It is
> >> >>   particularly helpful for the swap read, which usually are 4k random
> >> >>   IO.  This will improve the performance of the THP swap too.
> >> >> 
> >> >> - It will help the memory fragmentation, especially when the THP is
> >> >>   heavily used by the applications.  The 2M continuous pages will be
> >> >>   free up after THP swapping out.
> >> >
> >> > I just read patchset right now and still doubt why the all changes
> >> > should be coupled with THP tightly. Many parts(e.g., you introduced
> >> > or modifying existing functions for making them THP specific) could
> >> > just take page_list and the number of pages then would handle them
> >> > without THP awareness.
> >> 
> >> I am glad if my change could help normal pages swapping too.  And we can
> >> change these functions to work for normal pages when necessary.
> >
> > Sure but it would be less painful that THP awareness swapout is
> > based on multiple normal pages swapout. For exmaple, we don't
> > touch delay THP split part(i.e., split a THP into 512 pages like
> > as-is) and enhances swapout further like Tim's suggestion
> > for mulitple normal pages swapout. With that, it might be enough
> > for fast-storage without needing THP awareness.
> >
> > My *point* is let's approach step by step.
> > First of all, go with batching normal pages swapout and if it's
> > not enough, dive into further optimization like introducing
> > THP-aware swapout.
> >
> > I believe it's natural development process to evolve things
> > without over-engineering.
> 
> My target is not only the THP swap out acceleration, but also the full
> THP swap out/in support without splitting THP.  This patchset is just
> the first step of the full THP swap support.
> 
> >> > For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
> >> > can try to allocate new cluster. With that, we could allocate new
> >> > clusters to meet nr_pages requested or bail out if we fail to allocate
> >> > and fallback to 0-order page swapout. With that, swap layer could
> >> > support multiple order-0 pages by batch.
> >> >
> >> > IMO, I really want to land Tim Chen's batching swapout work first.
> >> > With Tim Chen's work, I expect we can make better refactoring
> >> > for batching swap before adding more confuse to the swap layer.
> >> > (I expect it would share several pieces of code for or would be base
> >> > for batching allocation of swapcache, swapslot)
> >> 
> >> I don't think there is hard conflict between normal pages swapping
> >> optimizing and THP swap optimizing.  Some code may be shared between
> >> them.  That is good for both sides.
> >> 
> >> > After that, we could enhance swap for big contiguous batching
> >> > like THP and finally we might make it be aware of THP specific to
> >> > enhance further.
> >> >
> >> > A thing I remember you aruged: you want to swapin 512 pages
> >> > all at once unconditionally. It's really worth to discuss if
> >> > your design is going for the way.
> >> > I doubt it's generally good idea. Because, currently, we try to
> >> > swap in swapped out pages in THP page with conservative approach
> >> > but your direction is going to opposite way.
> >> >
> >> > [mm, thp: convert from optimistic swapin collapsing to conservative]
> >> >
> >> > I think general approach(i.e., less effective than targeting
> >> > implement for your own specific goal but less hacky and better job
> >> > for many cases) is to rely/improve on the swap readahead.
> >> > If most of subpages of a THP page are really workingset, swap readahead
> >> > could work well.
> >> >
> >> > Yeah, it's fairly vague feedback so sorry if I miss something clear.
> >> 
> >> Yes.  I want to go to the direction that to swap in 512 pages together.
> >> And I think it is a good opportunity to discuss that now.  The advantages
> >> of swapping in 512 pages together are:
> >> 
> >> - Improve the performance of swapping in IO via turning small read size
> >>   into 512 pages big read size.
> >> 
> >> - Keep THP across swap out/in.  With the memory size become more and
> >>   more large, the 4k pages bring more and more burden to memory
> >>   management.  One solution is to use 2M pages as much as possible, that
> >>   will reduce the management burden greatly, such as much reduced length
> >>   of LRU list, etc.
> >> 
> >> The disadvantage are:
> >> 
> >> - Increase the memory pressure when swap in THP.
> >> 
> >> - Some pages swapped in may not needed in the near future.
> >> 
> >> Because of the disadvantages, the 512 pages swapping in should be made
> >> optional.  But I don't think we should make it impossible.
> >
> > Yeb. No need to make it impossible but your design shouldn't be coupled
> > with non-existing feature yet.
> 
> Sorry, what is the "non-existing feature"?  The full THP swap out/in
THP swapin.
You said you increased cluster size to fit a THP size for recording
some meta in there for THP swapin.
You gave number about how scale bad current swapout so try to enhance
that path. I agree it alghouth I don't like your approach for first step.
However, you didn't give any clue why we should swap in a THP. How bad
current conservative swapin from khugepagd is really bad and why cannot
enhance that.
> support without splitting THP?  If so, this patchset is the just the
> first step of that.  I plan to finish the the full THP swap out/in
> support in 3 steps:
> 
> 1. Delay splitting the THP after adding it into swap cache
> 
> 2. Delay splitting the THP after swapping out being completed
> 
> 3. Avoid splitting the THP during swap out, and swap in the full THP if
>    possible
> 
> I plan to do it step by step to make it easier to review the code.
1. If we solve batching swapout, then how is THP split for swapout bad?
2. Also, how is current conservatie swapin from khugepaged bad?
I think it's one of decision point for the motivation of your work
and for 1, we need batching swapout feature.
I am saying again that I'm not against your goal but only concern
is approach. If you don't agree, please ignore me.
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-13  7:05         ` Minchan Kim
@ 2016-09-13  8:53           ` Huang, Ying
  2016-09-13  9:16             ` Minchan Kim
  2016-09-13 14:35             ` Andrea Arcangeli
  0 siblings, 2 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-13  8:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko
Minchan Kim <minchan@kernel.org> writes:
> On Tue, Sep 13, 2016 at 02:40:00PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > Hi Huang,
>> >
>> > On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
>> >
>> > < snip >
>> >
>> >> >> Recently, the performance of the storage devices improved so fast that
>> >> >> we cannot saturate the disk bandwidth when do page swap out even on a
>> >> >> high-end server machine.  Because the performance of the storage
>> >> >> device improved faster than that of CPU.  And it seems that the trend
>> >> >> will not change in the near future.  On the other hand, the THP
>> >> >> becomes more and more popular because of increased memory size.  So it
>> >> >> becomes necessary to optimize THP swap performance.
>> >> >> 
>> >> >> The advantages of the THP swap support include:
>> >> >> 
>> >> >> - Batch the swap operations for the THP to reduce lock
>> >> >>   acquiring/releasing, including allocating/freeing the swap space,
>> >> >>   adding/deleting to/from the swap cache, and writing/reading the swap
>> >> >>   space, etc.  This will help improve the performance of the THP swap.
>> >> >> 
>> >> >> - The THP swap space read/write will be 2M sequential IO.  It is
>> >> >>   particularly helpful for the swap read, which usually are 4k random
>> >> >>   IO.  This will improve the performance of the THP swap too.
>> >> >> 
>> >> >> - It will help the memory fragmentation, especially when the THP is
>> >> >>   heavily used by the applications.  The 2M continuous pages will be
>> >> >>   free up after THP swapping out.
>> >> >
>> >> > I just read patchset right now and still doubt why the all changes
>> >> > should be coupled with THP tightly. Many parts(e.g., you introduced
>> >> > or modifying existing functions for making them THP specific) could
>> >> > just take page_list and the number of pages then would handle them
>> >> > without THP awareness.
>> >> 
>> >> I am glad if my change could help normal pages swapping too.  And we can
>> >> change these functions to work for normal pages when necessary.
>> >
>> > Sure but it would be less painful that THP awareness swapout is
>> > based on multiple normal pages swapout. For exmaple, we don't
>> > touch delay THP split part(i.e., split a THP into 512 pages like
>> > as-is) and enhances swapout further like Tim's suggestion
>> > for mulitple normal pages swapout. With that, it might be enough
>> > for fast-storage without needing THP awareness.
>> >
>> > My *point* is let's approach step by step.
>> > First of all, go with batching normal pages swapout and if it's
>> > not enough, dive into further optimization like introducing
>> > THP-aware swapout.
>> >
>> > I believe it's natural development process to evolve things
>> > without over-engineering.
>> 
>> My target is not only the THP swap out acceleration, but also the full
>> THP swap out/in support without splitting THP.  This patchset is just
>> the first step of the full THP swap support.
>> 
>> >> > For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
>> >> > can try to allocate new cluster. With that, we could allocate new
>> >> > clusters to meet nr_pages requested or bail out if we fail to allocate
>> >> > and fallback to 0-order page swapout. With that, swap layer could
>> >> > support multiple order-0 pages by batch.
>> >> >
>> >> > IMO, I really want to land Tim Chen's batching swapout work first.
>> >> > With Tim Chen's work, I expect we can make better refactoring
>> >> > for batching swap before adding more confuse to the swap layer.
>> >> > (I expect it would share several pieces of code for or would be base
>> >> > for batching allocation of swapcache, swapslot)
>> >> 
>> >> I don't think there is hard conflict between normal pages swapping
>> >> optimizing and THP swap optimizing.  Some code may be shared between
>> >> them.  That is good for both sides.
>> >> 
>> >> > After that, we could enhance swap for big contiguous batching
>> >> > like THP and finally we might make it be aware of THP specific to
>> >> > enhance further.
>> >> >
>> >> > A thing I remember you aruged: you want to swapin 512 pages
>> >> > all at once unconditionally. It's really worth to discuss if
>> >> > your design is going for the way.
>> >> > I doubt it's generally good idea. Because, currently, we try to
>> >> > swap in swapped out pages in THP page with conservative approach
>> >> > but your direction is going to opposite way.
>> >> >
>> >> > [mm, thp: convert from optimistic swapin collapsing to conservative]
>> >> >
>> >> > I think general approach(i.e., less effective than targeting
>> >> > implement for your own specific goal but less hacky and better job
>> >> > for many cases) is to rely/improve on the swap readahead.
>> >> > If most of subpages of a THP page are really workingset, swap readahead
>> >> > could work well.
>> >> >
>> >> > Yeah, it's fairly vague feedback so sorry if I miss something clear.
>> >> 
>> >> Yes.  I want to go to the direction that to swap in 512 pages together.
>> >> And I think it is a good opportunity to discuss that now.  The advantages
>> >> of swapping in 512 pages together are:
>> >> 
>> >> - Improve the performance of swapping in IO via turning small read size
>> >>   into 512 pages big read size.
>> >> 
>> >> - Keep THP across swap out/in.  With the memory size become more and
>> >>   more large, the 4k pages bring more and more burden to memory
>> >>   management.  One solution is to use 2M pages as much as possible, that
>> >>   will reduce the management burden greatly, such as much reduced length
>> >>   of LRU list, etc.
>> >> 
>> >> The disadvantage are:
>> >> 
>> >> - Increase the memory pressure when swap in THP.
>> >> 
>> >> - Some pages swapped in may not needed in the near future.
>> >> 
>> >> Because of the disadvantages, the 512 pages swapping in should be made
>> >> optional.  But I don't think we should make it impossible.
>> >
>> > Yeb. No need to make it impossible but your design shouldn't be coupled
>> > with non-existing feature yet.
>> 
>> Sorry, what is the "non-existing feature"?  The full THP swap out/in
>
> THP swapin.
>
> You said you increased cluster size to fit a THP size for recording
> some meta in there for THP swapin.
And to find the head of the THP to swap in the whole THP when an address
in the middle of a THP is accessed.
> You gave number about how scale bad current swapout so try to enhance
> that path. I agree it alghouth I don't like your approach for first step.
> However, you didn't give any clue why we should swap in a THP. How bad
> current conservative swapin from khugepagd is really bad and why cannot
> enhance that.
>
>> support without splitting THP?  If so, this patchset is the just the
>> first step of that.  I plan to finish the the full THP swap out/in
>> support in 3 steps:
>> 
>> 1. Delay splitting the THP after adding it into swap cache
>> 
>> 2. Delay splitting the THP after swapping out being completed
>> 
>> 3. Avoid splitting the THP during swap out, and swap in the full THP if
>>    possible
>> 
>> I plan to do it step by step to make it easier to review the code.
>
> 1. If we solve batching swapout, then how is THP split for swapout bad?
> 2. Also, how is current conservatie swapin from khugepaged bad?
>
> I think it's one of decision point for the motivation of your work
> and for 1, we need batching swapout feature.
>
> I am saying again that I'm not against your goal but only concern
> is approach. If you don't agree, please ignore me.
I am glad to discuss my final goal, that is, swapping out/in the full
THP without splitting.  Why I want to do that is copied as below,
>> >> The advantages of swapping in 512 pages together are:
>> >> 
>> >> - Improve the performance of swapping in IO via turning small read size
>> >>   into 512 pages big read size.
>> >> 
>> >> - Keep THP across swap out/in.  With the memory size become more and
>> >>   more large, the 4k pages bring more and more burden to memory
>> >>   management.  One solution is to use 2M pages as much as possible, that
>> >>   will reduce the management burden greatly, such as much reduced length
>> >>   of LRU list, etc.
- Avoid CPU time for splitting, collapsing THP across swap out/in.
>> >> 
>> >> The disadvantage are:
>> >> 
>> >> - Increase the memory pressure when swap in THP.
>> >> 
>> >> - Some pages swapped in may not needed in the near future.
I think it is important to use 2M pages as much as possible to deal with
the big memory problem.  Do you agree?
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-13  8:53           ` Huang, Ying
@ 2016-09-13  9:16             ` Minchan Kim
  2016-09-13 23:52               ` Chen, Tim C
  2016-09-18  1:53               ` Huang, Ying
  2016-09-13 14:35             ` Andrea Arcangeli
  1 sibling, 2 replies; 60+ messages in thread
From: Minchan Kim @ 2016-09-13  9:16 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko
On Tue, Sep 13, 2016 at 04:53:49PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> > On Tue, Sep 13, 2016 at 02:40:00PM +0800, Huang, Ying wrote:
> >> Minchan Kim <minchan@kernel.org> writes:
> >> 
> >> > Hi Huang,
> >> >
> >> > On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
> >> >
> >> > < snip >
> >> >
> >> >> >> Recently, the performance of the storage devices improved so fast that
> >> >> >> we cannot saturate the disk bandwidth when do page swap out even on a
> >> >> >> high-end server machine.  Because the performance of the storage
> >> >> >> device improved faster than that of CPU.  And it seems that the trend
> >> >> >> will not change in the near future.  On the other hand, the THP
> >> >> >> becomes more and more popular because of increased memory size.  So it
> >> >> >> becomes necessary to optimize THP swap performance.
> >> >> >> 
> >> >> >> The advantages of the THP swap support include:
> >> >> >> 
> >> >> >> - Batch the swap operations for the THP to reduce lock
> >> >> >>   acquiring/releasing, including allocating/freeing the swap space,
> >> >> >>   adding/deleting to/from the swap cache, and writing/reading the swap
> >> >> >>   space, etc.  This will help improve the performance of the THP swap.
> >> >> >> 
> >> >> >> - The THP swap space read/write will be 2M sequential IO.  It is
> >> >> >>   particularly helpful for the swap read, which usually are 4k random
> >> >> >>   IO.  This will improve the performance of the THP swap too.
> >> >> >> 
> >> >> >> - It will help the memory fragmentation, especially when the THP is
> >> >> >>   heavily used by the applications.  The 2M continuous pages will be
> >> >> >>   free up after THP swapping out.
> >> >> >
> >> >> > I just read patchset right now and still doubt why the all changes
> >> >> > should be coupled with THP tightly. Many parts(e.g., you introduced
> >> >> > or modifying existing functions for making them THP specific) could
> >> >> > just take page_list and the number of pages then would handle them
> >> >> > without THP awareness.
> >> >> 
> >> >> I am glad if my change could help normal pages swapping too.  And we can
> >> >> change these functions to work for normal pages when necessary.
> >> >
> >> > Sure but it would be less painful that THP awareness swapout is
> >> > based on multiple normal pages swapout. For exmaple, we don't
> >> > touch delay THP split part(i.e., split a THP into 512 pages like
> >> > as-is) and enhances swapout further like Tim's suggestion
> >> > for mulitple normal pages swapout. With that, it might be enough
> >> > for fast-storage without needing THP awareness.
> >> >
> >> > My *point* is let's approach step by step.
> >> > First of all, go with batching normal pages swapout and if it's
> >> > not enough, dive into further optimization like introducing
> >> > THP-aware swapout.
> >> >
> >> > I believe it's natural development process to evolve things
> >> > without over-engineering.
> >> 
> >> My target is not only the THP swap out acceleration, but also the full
> >> THP swap out/in support without splitting THP.  This patchset is just
> >> the first step of the full THP swap support.
> >> 
> >> >> > For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
> >> >> > can try to allocate new cluster. With that, we could allocate new
> >> >> > clusters to meet nr_pages requested or bail out if we fail to allocate
> >> >> > and fallback to 0-order page swapout. With that, swap layer could
> >> >> > support multiple order-0 pages by batch.
> >> >> >
> >> >> > IMO, I really want to land Tim Chen's batching swapout work first.
> >> >> > With Tim Chen's work, I expect we can make better refactoring
> >> >> > for batching swap before adding more confuse to the swap layer.
> >> >> > (I expect it would share several pieces of code for or would be base
> >> >> > for batching allocation of swapcache, swapslot)
> >> >> 
> >> >> I don't think there is hard conflict between normal pages swapping
> >> >> optimizing and THP swap optimizing.  Some code may be shared between
> >> >> them.  That is good for both sides.
> >> >> 
> >> >> > After that, we could enhance swap for big contiguous batching
> >> >> > like THP and finally we might make it be aware of THP specific to
> >> >> > enhance further.
> >> >> >
> >> >> > A thing I remember you aruged: you want to swapin 512 pages
> >> >> > all at once unconditionally. It's really worth to discuss if
> >> >> > your design is going for the way.
> >> >> > I doubt it's generally good idea. Because, currently, we try to
> >> >> > swap in swapped out pages in THP page with conservative approach
> >> >> > but your direction is going to opposite way.
> >> >> >
> >> >> > [mm, thp: convert from optimistic swapin collapsing to conservative]
> >> >> >
> >> >> > I think general approach(i.e., less effective than targeting
> >> >> > implement for your own specific goal but less hacky and better job
> >> >> > for many cases) is to rely/improve on the swap readahead.
> >> >> > If most of subpages of a THP page are really workingset, swap readahead
> >> >> > could work well.
> >> >> >
> >> >> > Yeah, it's fairly vague feedback so sorry if I miss something clear.
> >> >> 
> >> >> Yes.  I want to go to the direction that to swap in 512 pages together.
> >> >> And I think it is a good opportunity to discuss that now.  The advantages
> >> >> of swapping in 512 pages together are:
> >> >> 
> >> >> - Improve the performance of swapping in IO via turning small read size
> >> >>   into 512 pages big read size.
> >> >> 
> >> >> - Keep THP across swap out/in.  With the memory size become more and
> >> >>   more large, the 4k pages bring more and more burden to memory
> >> >>   management.  One solution is to use 2M pages as much as possible, that
> >> >>   will reduce the management burden greatly, such as much reduced length
> >> >>   of LRU list, etc.
> >> >> 
> >> >> The disadvantage are:
> >> >> 
> >> >> - Increase the memory pressure when swap in THP.
> >> >> 
> >> >> - Some pages swapped in may not needed in the near future.
> >> >> 
> >> >> Because of the disadvantages, the 512 pages swapping in should be made
> >> >> optional.  But I don't think we should make it impossible.
> >> >
> >> > Yeb. No need to make it impossible but your design shouldn't be coupled
> >> > with non-existing feature yet.
> >> 
> >> Sorry, what is the "non-existing feature"?  The full THP swap out/in
> >
> > THP swapin.
> >
> > You said you increased cluster size to fit a THP size for recording
> > some meta in there for THP swapin.
> 
> And to find the head of the THP to swap in the whole THP when an address
> in the middle of a THP is accessed.
> 
> > You gave number about how scale bad current swapout so try to enhance
> > that path. I agree it alghouth I don't like your approach for first step.
> > However, you didn't give any clue why we should swap in a THP. How bad
> > current conservative swapin from khugepagd is really bad and why cannot
> > enhance that.
> >
> >> support without splitting THP?  If so, this patchset is the just the
> >> first step of that.  I plan to finish the the full THP swap out/in
> >> support in 3 steps:
> >> 
> >> 1. Delay splitting the THP after adding it into swap cache
> >> 
> >> 2. Delay splitting the THP after swapping out being completed
> >> 
> >> 3. Avoid splitting the THP during swap out, and swap in the full THP if
> >>    possible
> >> 
> >> I plan to do it step by step to make it easier to review the code.
> >
> > 1. If we solve batching swapout, then how is THP split for swapout bad?
> > 2. Also, how is current conservatie swapin from khugepaged bad?
> >
> > I think it's one of decision point for the motivation of your work
> > and for 1, we need batching swapout feature.
> >
> > I am saying again that I'm not against your goal but only concern
> > is approach. If you don't agree, please ignore me.
> 
> I am glad to discuss my final goal, that is, swapping out/in the full
> THP without splitting.  Why I want to do that is copied as below,
Yes, it's your *final* goal but what if it couldn't be acceptable
on second step you mentioned above, for example?
        Unncessary binded implementation to rejected work.
If you want to achieve your goal step by step, please consider if
one of step you are thinking could be rejected but steps already
merged should be self-contained without side-effect.
If it's hard, send full patchset all at once so reviewers can think
what you want of right direction and implementation is good for it.
> 
> >> >> The advantages of swapping in 512 pages together are:
> >> >> 
> >> >> - Improve the performance of swapping in IO via turning small read size
> >> >>   into 512 pages big read size.
> >> >> 
> >> >> - Keep THP across swap out/in.  With the memory size become more and
> >> >>   more large, the 4k pages bring more and more burden to memory
> >> >>   management.  One solution is to use 2M pages as much as possible, that
> >> >>   will reduce the management burden greatly, such as much reduced length
> >> >>   of LRU list, etc.
> 
> - Avoid CPU time for splitting, collapsing THP across swap out/in.
Yes, if you want, please give us how bad it is.
> 
> >> >> 
> >> >> The disadvantage are:
> >> >> 
> >> >> - Increase the memory pressure when swap in THP.
> >> >> 
> >> >> - Some pages swapped in may not needed in the near future.
> 
> I think it is important to use 2M pages as much as possible to deal with
> the big memory problem.  Do you agree?
There is no number I can think what is current problems and
how it is popular thesedays so I don't agree.
> 
> Best Regards,
> Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-13  8:53           ` Huang, Ying
  2016-09-13  9:16             ` Minchan Kim
@ 2016-09-13 14:35             ` Andrea Arcangeli
  1 sibling, 0 replies; 60+ messages in thread
From: Andrea Arcangeli @ 2016-09-13 14:35 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Minchan Kim, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Rik van Riel, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko
Hello,
On Tue, Sep 13, 2016 at 04:53:49PM +0800, Huang, Ying wrote:
> I am glad to discuss my final goal, that is, swapping out/in the full
> THP without splitting.  Why I want to do that is copied as below,
I think that is a fine objective. It wasn't implemented initially just
to keep things simple.
Doing it will reduce swap fragmentation (provided we can find a
physically contiguous piece of to swapout the THP in the first place)
and it will make all other heuristics that tries to keep the swap
space contiguous less relevant and it should increase the swap
bandwidth significantly at least on spindle disks. I personally see it
as a positive that we relay less on those and the readhaead swapin.
> >> >> The disadvantage are:
> >> >> 
> >> >> - Increase the memory pressure when swap in THP.
That is always true with THP enabled to always. It is the tradeoff. It
still cannot use more RAM than userland ever allocated in the vma as
virtual memory. If userland don't ever need such memory it can free it
by zapping the vma and the THP will be splitted. If the vma is zapped
while the THP is natively swapped out, the zapped portion of swap
space shall be released as well. So ultimately userland always
controls the cap on the max virtual memory (ram+swap) the kernel
decides to use with THP enabled to always.
> I think it is important to use 2M pages as much as possible to deal with
> the big memory problem.  Do you agree?
I agree.
Thanks,
Andrea
^ permalink raw reply	[flat|nested] 60+ messages in thread
* RE: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-13  9:16             ` Minchan Kim
@ 2016-09-13 23:52               ` Chen, Tim C
  2016-09-19  7:11                 ` Minchan Kim
  2016-09-18  1:53               ` Huang, Ying
  1 sibling, 1 reply; 60+ messages in thread
From: Chen, Tim C @ 2016-09-13 23:52 UTC (permalink / raw)
  To: Minchan Kim, Huang, Ying
  Cc: Andrew Morton, Hansen, Dave, Kleen, Andi, Lu, Aaron,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, Hugh Dickins,
	Shaohua Li, Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko
>>
>> - Avoid CPU time for splitting, collapsing THP across swap out/in.
>
>Yes, if you want, please give us how bad it is.
>
It could be pretty bad.  In an experiment with THP turned on and we
enter swap, 50% of the cpu are spent in the page compaction path.  
So if we could deal with units of large page for swap, the splitting
and compaction of ordinary pages to large page overhead could be avoided.
   51.89%    51.89%            :1688  [kernel.kallsyms]   [k] pageblock_pfn_to_page                       
                      |
                      --- pageblock_pfn_to_page
                         |          
                         |--64.57%-- compaction_alloc
                         |          migrate_pages
                         |          compact_zone
                         |          compact_zone_order
                         |          try_to_compact_pages
                         |          __alloc_pages_direct_compact
                         |          __alloc_pages_nodemask
                         |          alloc_pages_vma
                         |          do_huge_pmd_anonymous_page
                         |          handle_mm_fault
                         |          __do_page_fault
                         |          do_page_fault
                         |          page_fault
                         |          0x401d9a
                         |          
                         |--34.62%-- compact_zone
                         |          compact_zone_order
                         |          try_to_compact_pages
                         |          __alloc_pages_direct_compact
                         |          __alloc_pages_nodemask
                         |          alloc_pages_vma
                         |          do_huge_pmd_anonymous_page
                         |          handle_mm_fault
                         |          __do_page_fault
                         |          do_page_fault
                         |          page_fault
                         |          0x401d9a
                          --0.81%-- [...]
Tim
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-13  9:16             ` Minchan Kim
  2016-09-13 23:52               ` Chen, Tim C
@ 2016-09-18  1:53               ` Huang, Ying
  2016-09-19  7:08                 ` Minchan Kim
  1 sibling, 1 reply; 60+ messages in thread
From: Huang, Ying @ 2016-09-18  1:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko
Minchan Kim <minchan@kernel.org> writes:
> On Tue, Sep 13, 2016 at 04:53:49PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> > On Tue, Sep 13, 2016 at 02:40:00PM +0800, Huang, Ying wrote:
>> >> Minchan Kim <minchan@kernel.org> writes:
>> >> 
>> >> > Hi Huang,
>> >> >
>> >> > On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
>> >> >
>> >> > < snip >
>> >> >
>> >> >> >> Recently, the performance of the storage devices improved so fast that
>> >> >> >> we cannot saturate the disk bandwidth when do page swap out even on a
>> >> >> >> high-end server machine.  Because the performance of the storage
>> >> >> >> device improved faster than that of CPU.  And it seems that the trend
>> >> >> >> will not change in the near future.  On the other hand, the THP
>> >> >> >> becomes more and more popular because of increased memory size.  So it
>> >> >> >> becomes necessary to optimize THP swap performance.
>> >> >> >> 
>> >> >> >> The advantages of the THP swap support include:
>> >> >> >> 
>> >> >> >> - Batch the swap operations for the THP to reduce lock
>> >> >> >>   acquiring/releasing, including allocating/freeing the swap space,
>> >> >> >>   adding/deleting to/from the swap cache, and writing/reading the swap
>> >> >> >>   space, etc.  This will help improve the performance of the THP swap.
>> >> >> >> 
>> >> >> >> - The THP swap space read/write will be 2M sequential IO.  It is
>> >> >> >>   particularly helpful for the swap read, which usually are 4k random
>> >> >> >>   IO.  This will improve the performance of the THP swap too.
>> >> >> >> 
>> >> >> >> - It will help the memory fragmentation, especially when the THP is
>> >> >> >>   heavily used by the applications.  The 2M continuous pages will be
>> >> >> >>   free up after THP swapping out.
>> >> >> >
>> >> >> > I just read patchset right now and still doubt why the all changes
>> >> >> > should be coupled with THP tightly. Many parts(e.g., you introduced
>> >> >> > or modifying existing functions for making them THP specific) could
>> >> >> > just take page_list and the number of pages then would handle them
>> >> >> > without THP awareness.
>> >> >> 
>> >> >> I am glad if my change could help normal pages swapping too.  And we can
>> >> >> change these functions to work for normal pages when necessary.
>> >> >
>> >> > Sure but it would be less painful that THP awareness swapout is
>> >> > based on multiple normal pages swapout. For exmaple, we don't
>> >> > touch delay THP split part(i.e., split a THP into 512 pages like
>> >> > as-is) and enhances swapout further like Tim's suggestion
>> >> > for mulitple normal pages swapout. With that, it might be enough
>> >> > for fast-storage without needing THP awareness.
>> >> >
>> >> > My *point* is let's approach step by step.
>> >> > First of all, go with batching normal pages swapout and if it's
>> >> > not enough, dive into further optimization like introducing
>> >> > THP-aware swapout.
>> >> >
>> >> > I believe it's natural development process to evolve things
>> >> > without over-engineering.
>> >> 
>> >> My target is not only the THP swap out acceleration, but also the full
>> >> THP swap out/in support without splitting THP.  This patchset is just
>> >> the first step of the full THP swap support.
>> >> 
>> >> >> > For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
>> >> >> > can try to allocate new cluster. With that, we could allocate new
>> >> >> > clusters to meet nr_pages requested or bail out if we fail to allocate
>> >> >> > and fallback to 0-order page swapout. With that, swap layer could
>> >> >> > support multiple order-0 pages by batch.
>> >> >> >
>> >> >> > IMO, I really want to land Tim Chen's batching swapout work first.
>> >> >> > With Tim Chen's work, I expect we can make better refactoring
>> >> >> > for batching swap before adding more confuse to the swap layer.
>> >> >> > (I expect it would share several pieces of code for or would be base
>> >> >> > for batching allocation of swapcache, swapslot)
>> >> >> 
>> >> >> I don't think there is hard conflict between normal pages swapping
>> >> >> optimizing and THP swap optimizing.  Some code may be shared between
>> >> >> them.  That is good for both sides.
>> >> >> 
>> >> >> > After that, we could enhance swap for big contiguous batching
>> >> >> > like THP and finally we might make it be aware of THP specific to
>> >> >> > enhance further.
>> >> >> >
>> >> >> > A thing I remember you aruged: you want to swapin 512 pages
>> >> >> > all at once unconditionally. It's really worth to discuss if
>> >> >> > your design is going for the way.
>> >> >> > I doubt it's generally good idea. Because, currently, we try to
>> >> >> > swap in swapped out pages in THP page with conservative approach
>> >> >> > but your direction is going to opposite way.
>> >> >> >
>> >> >> > [mm, thp: convert from optimistic swapin collapsing to conservative]
>> >> >> >
>> >> >> > I think general approach(i.e., less effective than targeting
>> >> >> > implement for your own specific goal but less hacky and better job
>> >> >> > for many cases) is to rely/improve on the swap readahead.
>> >> >> > If most of subpages of a THP page are really workingset, swap readahead
>> >> >> > could work well.
>> >> >> >
>> >> >> > Yeah, it's fairly vague feedback so sorry if I miss something clear.
>> >> >> 
>> >> >> Yes.  I want to go to the direction that to swap in 512 pages together.
>> >> >> And I think it is a good opportunity to discuss that now.  The advantages
>> >> >> of swapping in 512 pages together are:
>> >> >> 
>> >> >> - Improve the performance of swapping in IO via turning small read size
>> >> >>   into 512 pages big read size.
>> >> >> 
>> >> >> - Keep THP across swap out/in.  With the memory size become more and
>> >> >>   more large, the 4k pages bring more and more burden to memory
>> >> >>   management.  One solution is to use 2M pages as much as possible, that
>> >> >>   will reduce the management burden greatly, such as much reduced length
>> >> >>   of LRU list, etc.
>> >> >> 
>> >> >> The disadvantage are:
>> >> >> 
>> >> >> - Increase the memory pressure when swap in THP.
>> >> >> 
>> >> >> - Some pages swapped in may not needed in the near future.
>> >> >> 
>> >> >> Because of the disadvantages, the 512 pages swapping in should be made
>> >> >> optional.  But I don't think we should make it impossible.
>> >> >
>> >> > Yeb. No need to make it impossible but your design shouldn't be coupled
>> >> > with non-existing feature yet.
>> >> 
>> >> Sorry, what is the "non-existing feature"?  The full THP swap out/in
>> >
>> > THP swapin.
>> >
>> > You said you increased cluster size to fit a THP size for recording
>> > some meta in there for THP swapin.
>> 
>> And to find the head of the THP to swap in the whole THP when an address
>> in the middle of a THP is accessed.
>> 
>> > You gave number about how scale bad current swapout so try to enhance
>> > that path. I agree it alghouth I don't like your approach for first step.
>> > However, you didn't give any clue why we should swap in a THP. How bad
>> > current conservative swapin from khugepagd is really bad and why cannot
>> > enhance that.
>> >
>> >> support without splitting THP?  If so, this patchset is the just the
>> >> first step of that.  I plan to finish the the full THP swap out/in
>> >> support in 3 steps:
>> >> 
>> >> 1. Delay splitting the THP after adding it into swap cache
>> >> 
>> >> 2. Delay splitting the THP after swapping out being completed
>> >> 
>> >> 3. Avoid splitting the THP during swap out, and swap in the full THP if
>> >>    possible
>> >> 
>> >> I plan to do it step by step to make it easier to review the code.
>> >
>> > 1. If we solve batching swapout, then how is THP split for swapout bad?
>> > 2. Also, how is current conservatie swapin from khugepaged bad?
>> >
>> > I think it's one of decision point for the motivation of your work
>> > and for 1, we need batching swapout feature.
>> >
>> > I am saying again that I'm not against your goal but only concern
>> > is approach. If you don't agree, please ignore me.
>> 
>> I am glad to discuss my final goal, that is, swapping out/in the full
>> THP without splitting.  Why I want to do that is copied as below,
>
> Yes, it's your *final* goal but what if it couldn't be acceptable
> on second step you mentioned above, for example?
>
>         Unncessary binded implementation to rejected work.
So I want to discuss my final goal.  If people accept my final goal,
this is resolved.  If people don't accept, I will reconsider it.
> If you want to achieve your goal step by step, please consider if
> one of step you are thinking could be rejected but steps already
> merged should be self-contained without side-effect.
What is the side-effect or possible regressions of the step 1 as in this
patchset?  Lacks the opportunity to allocate consecutive 512 swap slots
in 2 non-free swap clusters?  I don't think that is a regression,
because the patchset will NOT make free swap clusters consumed faster
than that in current code.  Even if it were better to allocate
consecutive 512 swap slots in 2 non-free swap clusters, it could be an
incremental improvement to the simple solution in this patchset.  That
is, to allocate 512 swap slots, the simple solution is:
a) Try to allocate a free swap cluster
b) If a) fails, give up
The improved solution could be (if it were needed finally)
a) Try to allocate a free swap cluster
b) If a) fails, try to allocate consecutive 512 swap slots in 2 non-free
   swap clusters
c) If b) fails, give up
> If it's hard, send full patchset all at once so reviewers can think
> what you want of right direction and implementation is good for it.
Thanks for suggestion.
[snip]
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-18  1:53               ` Huang, Ying
@ 2016-09-19  7:08                 ` Minchan Kim
  2016-09-20  2:54                   ` Huang, Ying
  0 siblings, 1 reply; 60+ messages in thread
From: Minchan Kim @ 2016-09-19  7:08 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko
Hi Huang,
On Sun, Sep 18, 2016 at 09:53:39AM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > On Tue, Sep 13, 2016 at 04:53:49PM +0800, Huang, Ying wrote:
> >> Minchan Kim <minchan@kernel.org> writes:
> >> > On Tue, Sep 13, 2016 at 02:40:00PM +0800, Huang, Ying wrote:
> >> >> Minchan Kim <minchan@kernel.org> writes:
> >> >> 
> >> >> > Hi Huang,
> >> >> >
> >> >> > On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
> >> >> >
> >> >> > < snip >
> >> >> >
> >> >> >> >> Recently, the performance of the storage devices improved so fast that
> >> >> >> >> we cannot saturate the disk bandwidth when do page swap out even on a
> >> >> >> >> high-end server machine.  Because the performance of the storage
> >> >> >> >> device improved faster than that of CPU.  And it seems that the trend
> >> >> >> >> will not change in the near future.  On the other hand, the THP
> >> >> >> >> becomes more and more popular because of increased memory size.  So it
> >> >> >> >> becomes necessary to optimize THP swap performance.
> >> >> >> >> 
> >> >> >> >> The advantages of the THP swap support include:
> >> >> >> >> 
> >> >> >> >> - Batch the swap operations for the THP to reduce lock
> >> >> >> >>   acquiring/releasing, including allocating/freeing the swap space,
> >> >> >> >>   adding/deleting to/from the swap cache, and writing/reading the swap
> >> >> >> >>   space, etc.  This will help improve the performance of the THP swap.
> >> >> >> >> 
> >> >> >> >> - The THP swap space read/write will be 2M sequential IO.  It is
> >> >> >> >>   particularly helpful for the swap read, which usually are 4k random
> >> >> >> >>   IO.  This will improve the performance of the THP swap too.
> >> >> >> >> 
> >> >> >> >> - It will help the memory fragmentation, especially when the THP is
> >> >> >> >>   heavily used by the applications.  The 2M continuous pages will be
> >> >> >> >>   free up after THP swapping out.
> >> >> >> >
> >> >> >> > I just read patchset right now and still doubt why the all changes
> >> >> >> > should be coupled with THP tightly. Many parts(e.g., you introduced
> >> >> >> > or modifying existing functions for making them THP specific) could
> >> >> >> > just take page_list and the number of pages then would handle them
> >> >> >> > without THP awareness.
> >> >> >> 
> >> >> >> I am glad if my change could help normal pages swapping too.  And we can
> >> >> >> change these functions to work for normal pages when necessary.
> >> >> >
> >> >> > Sure but it would be less painful that THP awareness swapout is
> >> >> > based on multiple normal pages swapout. For exmaple, we don't
> >> >> > touch delay THP split part(i.e., split a THP into 512 pages like
> >> >> > as-is) and enhances swapout further like Tim's suggestion
> >> >> > for mulitple normal pages swapout. With that, it might be enough
> >> >> > for fast-storage without needing THP awareness.
> >> >> >
> >> >> > My *point* is let's approach step by step.
> >> >> > First of all, go with batching normal pages swapout and if it's
> >> >> > not enough, dive into further optimization like introducing
> >> >> > THP-aware swapout.
> >> >> >
> >> >> > I believe it's natural development process to evolve things
> >> >> > without over-engineering.
> >> >> 
> >> >> My target is not only the THP swap out acceleration, but also the full
> >> >> THP swap out/in support without splitting THP.  This patchset is just
> >> >> the first step of the full THP swap support.
> >> >> 
> >> >> >> > For example, if the nr_pages is larger than SWAPFILE_CLUSTER, we
> >> >> >> > can try to allocate new cluster. With that, we could allocate new
> >> >> >> > clusters to meet nr_pages requested or bail out if we fail to allocate
> >> >> >> > and fallback to 0-order page swapout. With that, swap layer could
> >> >> >> > support multiple order-0 pages by batch.
> >> >> >> >
> >> >> >> > IMO, I really want to land Tim Chen's batching swapout work first.
> >> >> >> > With Tim Chen's work, I expect we can make better refactoring
> >> >> >> > for batching swap before adding more confuse to the swap layer.
> >> >> >> > (I expect it would share several pieces of code for or would be base
> >> >> >> > for batching allocation of swapcache, swapslot)
> >> >> >> 
> >> >> >> I don't think there is hard conflict between normal pages swapping
> >> >> >> optimizing and THP swap optimizing.  Some code may be shared between
> >> >> >> them.  That is good for both sides.
> >> >> >> 
> >> >> >> > After that, we could enhance swap for big contiguous batching
> >> >> >> > like THP and finally we might make it be aware of THP specific to
> >> >> >> > enhance further.
> >> >> >> >
> >> >> >> > A thing I remember you aruged: you want to swapin 512 pages
> >> >> >> > all at once unconditionally. It's really worth to discuss if
> >> >> >> > your design is going for the way.
> >> >> >> > I doubt it's generally good idea. Because, currently, we try to
> >> >> >> > swap in swapped out pages in THP page with conservative approach
> >> >> >> > but your direction is going to opposite way.
> >> >> >> >
> >> >> >> > [mm, thp: convert from optimistic swapin collapsing to conservative]
> >> >> >> >
> >> >> >> > I think general approach(i.e., less effective than targeting
> >> >> >> > implement for your own specific goal but less hacky and better job
> >> >> >> > for many cases) is to rely/improve on the swap readahead.
> >> >> >> > If most of subpages of a THP page are really workingset, swap readahead
> >> >> >> > could work well.
> >> >> >> >
> >> >> >> > Yeah, it's fairly vague feedback so sorry if I miss something clear.
> >> >> >> 
> >> >> >> Yes.  I want to go to the direction that to swap in 512 pages together.
> >> >> >> And I think it is a good opportunity to discuss that now.  The advantages
> >> >> >> of swapping in 512 pages together are:
> >> >> >> 
> >> >> >> - Improve the performance of swapping in IO via turning small read size
> >> >> >>   into 512 pages big read size.
> >> >> >> 
> >> >> >> - Keep THP across swap out/in.  With the memory size become more and
> >> >> >>   more large, the 4k pages bring more and more burden to memory
> >> >> >>   management.  One solution is to use 2M pages as much as possible, that
> >> >> >>   will reduce the management burden greatly, such as much reduced length
> >> >> >>   of LRU list, etc.
> >> >> >> 
> >> >> >> The disadvantage are:
> >> >> >> 
> >> >> >> - Increase the memory pressure when swap in THP.
> >> >> >> 
> >> >> >> - Some pages swapped in may not needed in the near future.
> >> >> >> 
> >> >> >> Because of the disadvantages, the 512 pages swapping in should be made
> >> >> >> optional.  But I don't think we should make it impossible.
> >> >> >
> >> >> > Yeb. No need to make it impossible but your design shouldn't be coupled
> >> >> > with non-existing feature yet.
> >> >> 
> >> >> Sorry, what is the "non-existing feature"?  The full THP swap out/in
> >> >
> >> > THP swapin.
> >> >
> >> > You said you increased cluster size to fit a THP size for recording
> >> > some meta in there for THP swapin.
> >> 
> >> And to find the head of the THP to swap in the whole THP when an address
> >> in the middle of a THP is accessed.
> >> 
> >> > You gave number about how scale bad current swapout so try to enhance
> >> > that path. I agree it alghouth I don't like your approach for first step.
> >> > However, you didn't give any clue why we should swap in a THP. How bad
> >> > current conservative swapin from khugepagd is really bad and why cannot
> >> > enhance that.
> >> >
> >> >> support without splitting THP?  If so, this patchset is the just the
> >> >> first step of that.  I plan to finish the the full THP swap out/in
> >> >> support in 3 steps:
> >> >> 
> >> >> 1. Delay splitting the THP after adding it into swap cache
> >> >> 
> >> >> 2. Delay splitting the THP after swapping out being completed
> >> >> 
> >> >> 3. Avoid splitting the THP during swap out, and swap in the full THP if
> >> >>    possible
> >> >> 
> >> >> I plan to do it step by step to make it easier to review the code.
> >> >
> >> > 1. If we solve batching swapout, then how is THP split for swapout bad?
> >> > 2. Also, how is current conservatie swapin from khugepaged bad?
> >> >
> >> > I think it's one of decision point for the motivation of your work
> >> > and for 1, we need batching swapout feature.
> >> >
> >> > I am saying again that I'm not against your goal but only concern
> >> > is approach. If you don't agree, please ignore me.
> >> 
> >> I am glad to discuss my final goal, that is, swapping out/in the full
> >> THP without splitting.  Why I want to do that is copied as below,
> >
> > Yes, it's your *final* goal but what if it couldn't be acceptable
> > on second step you mentioned above, for example?
> >
> >         Unncessary binded implementation to rejected work.
> 
> So I want to discuss my final goal.  If people accept my final goal,
> this is resolved.  If people don't accept, I will reconsider it.
No.
Please keep it in mind. There are lots of factors the project would
be broken during going on by several reasons because we are human being
so we can simply miss something clear and realize it later that it's
not feasible. Otherwise, others can show up with better idea for the
goal or fix other subsystem which can affect your goals.
I don't want to say such boring theoretical stuffs any more.
My point is patchset should be self-contained if you really want to go
with step-by-step approach because we are likely to miss something
*easily*.
> 
> > If you want to achieve your goal step by step, please consider if
> > one of step you are thinking could be rejected but steps already
> > merged should be self-contained without side-effect.
> 
> What is the side-effect or possible regressions of the step 1 as in this
Adding code complexity for unproved feature.
When I read your steps, your *most important* goal is to avoid split/
collapsing anon THP page for swap out/in. As a bonus with the approach,
we could increase swapout/in bandwidth, too. Do I understand correctly?
However, swap-in/out bandwidth enhance is common requirement for both
normal and THP page and with Tim's work, we could enhance swapout path.
So, I think you should give us to number about how THP split is bad
for the swapout bandwidth even though we applied Tim's work.
If it's serious, next approach is yours that we could tweak swap code
be aware of a THP to avoid splitting a THP.
For THP swap-in, I think it's another topic we should discuss.
For each step, it's orthogonal work so it shouldn't rely on next goal.
> patchset?  Lacks the opportunity to allocate consecutive 512 swap slots
> in 2 non-free swap clusters?  I don't think that is a regression,
> because the patchset will NOT make free swap clusters consumed faster
> than that in current code.  Even if it were better to allocate
> consecutive 512 swap slots in 2 non-free swap clusters, it could be an
> incremental improvement to the simple solution in this patchset.  That
> is, to allocate 512 swap slots, the simple solution is:
> 
> a) Try to allocate a free swap cluster
> b) If a) fails, give up
> 
> The improved solution could be (if it were needed finally)
> 
> a) Try to allocate a free swap cluster
> b) If a) fails, try to allocate consecutive 512 swap slots in 2 non-free
>    swap clusters
> c) If b) fails, give up
I didn't mean it. Please read above.
> 
> > If it's hard, send full patchset all at once so reviewers can think
> > what you want of right direction and implementation is good for it.
> 
> Thanks for suggestion.
Huang,
I'm sorry if I misunderstand something. And I should admit I'm not a THP
user even so I'm blind on a THP workload so sorry too if I miss really
something clear. However, my concern is adding more complexity to swap
layer without justfication and to me, it's really hard to understand your
motivation from your description.
If you want step by step approach, for the first step, please prove
how THP split is bad in swapout path and it would be better to consider
how to make codes shareable with normal pages batching so THP awareness
on top of normal page batching, it would be more easy to prove/review,
I think.
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-13 23:52               ` Chen, Tim C
@ 2016-09-19  7:11                 ` Minchan Kim
  2016-09-19 15:59                   ` Tim Chen
  0 siblings, 1 reply; 60+ messages in thread
From: Minchan Kim @ 2016-09-19  7:11 UTC (permalink / raw)
  To: Chen, Tim C
  Cc: Huang, Ying, Andrew Morton, Hansen, Dave, Kleen, Andi, Lu, Aaron,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, Hugh Dickins,
	Shaohua Li, Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko
Hi Tim,
On Tue, Sep 13, 2016 at 11:52:27PM +0000, Chen, Tim C wrote:
> >>
> >> - Avoid CPU time for splitting, collapsing THP across swap out/in.
> >
> >Yes, if you want, please give us how bad it is.
> >
> 
> It could be pretty bad.  In an experiment with THP turned on and we
> enter swap, 50% of the cpu are spent in the page compaction path.  
It's page compaction overhead, especially, pageblock_pfn_to_page.
Why is it related to overhead THP split for swapout?
I don't understand.
> So if we could deal with units of large page for swap, the splitting
> and compaction of ordinary pages to large page overhead could be avoided.
> 
>    51.89%    51.89%            :1688  [kernel.kallsyms]   [k] pageblock_pfn_to_page                       
>                       |
>                       --- pageblock_pfn_to_page
>                          |          
>                          |--64.57%-- compaction_alloc
>                          |          migrate_pages
>                          |          compact_zone
>                          |          compact_zone_order
>                          |          try_to_compact_pages
>                          |          __alloc_pages_direct_compact
>                          |          __alloc_pages_nodemask
>                          |          alloc_pages_vma
>                          |          do_huge_pmd_anonymous_page
>                          |          handle_mm_fault
>                          |          __do_page_fault
>                          |          do_page_fault
>                          |          page_fault
>                          |          0x401d9a
>                          |          
>                          |--34.62%-- compact_zone
>                          |          compact_zone_order
>                          |          try_to_compact_pages
>                          |          __alloc_pages_direct_compact
>                          |          __alloc_pages_nodemask
>                          |          alloc_pages_vma
>                          |          do_huge_pmd_anonymous_page
>                          |          handle_mm_fault
>                          |          __do_page_fault
>                          |          do_page_fault
>                          |          page_fault
>                          |          0x401d9a
>                           --0.81%-- [...]
> 
> Tim
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-19  7:11                 ` Minchan Kim
@ 2016-09-19 15:59                   ` Tim Chen
  0 siblings, 0 replies; 60+ messages in thread
From: Tim Chen @ 2016-09-19 15:59 UTC (permalink / raw)
  To: Minchan Kim, Chen, Tim C
  Cc: Huang, Ying, Andrew Morton, Hansen, Dave, Kleen, Andi, Lu, Aaron,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, Hugh Dickins,
	Shaohua Li, Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko
On Mon, 2016-09-19 at 16:11 +0900, Minchan Kim wrote:
> Hi Tim,
> 
> On Tue, Sep 13, 2016 at 11:52:27PM +0000, Chen, Tim C wrote:
> > 
> > > 
> > > > 
> > > > 
> > > > - Avoid CPU time for splitting, collapsing THP across swap out/in.
> > > Yes, if you want, please give us how bad it is.
> > > 
> > It could be pretty bad.  In an experiment with THP turned on and we
> > enter swap, 50% of the cpu are spent in the page compaction path.  
> It's page compaction overhead, especially, pageblock_pfn_to_page.
> Why is it related to overhead THP split for swapout?
> I don't understand.
Today you have to split a large page into 4K pages to swap it out.
Then after you swap in all the 4K pages, you have to re-compact
them back into a large page.
If you can swap the large page out as a contiguous unit, and swap
it back in as a single large page, the splitting and re-compaction
back into a large page can be avoided.
Tim
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64
  2016-09-08  5:45   ` Anshuman Khandual
  2016-09-08 18:07     ` Huang, Ying
@ 2016-09-19 17:09     ` Johannes Weiner
  2016-09-20  2:01       ` Huang, Ying
  1 sibling, 1 reply; 60+ messages in thread
From: Johannes Weiner @ 2016-09-19 17:09 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel
On Thu, Sep 08, 2016 at 11:15:52AM +0530, Anshuman Khandual wrote:
> On 09/07/2016 10:16 PM, Huang, Ying wrote:
> > From: Huang Ying <ying.huang@intel.com>
> > 
> > In this patch, the size of the swap cluster is changed to that of the
> > THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
> > the THP swap support on x86_64.  Where one swap cluster will be used to
> > hold the contents of each THP swapped out.  And some information of the
> > swapped out THP (such as compound map count) will be recorded in the
> > swap_cluster_info data structure.
> > 
> > For other architectures which want THP swap support, THP_SWAP_CLUSTER
> > need to be selected in the Kconfig file for the architecture.
> > 
> > In effect, this will enlarge swap cluster size by 2 times on x86_64.
> > Which may make it harder to find a free cluster when the swap space
> > becomes fragmented.  So that, this may reduce the continuous swap space
> > allocation and sequential write in theory.  The performance test in 0day
> > shows no regressions caused by this.
> 
> This patch needs to be split into two separate ones
> 
> (1) Add THP_SWAP_CLUSTER config option
> (2) Enable CONFIG_THP_SWAP_CLUSTER for X86_64
No, don't do that. This is a bit of an anti-pattern in this series,
where it introduces a thing in one patch, and a user for it in a later
patch. However, in order to judge whether that thing is good or not, I
need to know how exactly it's being used.
So, please, split your series into logical steps, not geographical
ones. When you introduce a function, config option, symbol, add it
along with the code that actually *uses* it, in the same patch.
It goes for this patch, but also stuff like the memcg accounting
functions, get_huge_swap_page() etc.
Start with the logical change, then try to isolate independent changes
that could make sense even without the rest of the series. If that
results in a large patch, then so be it. If a big change is hard to
review, then making me switch back and forth between emails will make
it harder, not easier, to make make sense of it.
Thanks
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-07 16:45 [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (10 preceding siblings ...)
  2016-09-09  5:43 ` [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Minchan Kim
@ 2016-09-19 17:33 ` Hugh Dickins
  2016-09-22 22:56 ` Shaohua Li
  12 siblings, 0 replies; 60+ messages in thread
From: Hugh Dickins @ 2016-09-19 17:33 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko
On Wed, 7 Sep 2016, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> This patchset is to optimize the performance of Transparent Huge Page
> (THP) swap.
> 
> Hi, Andrew, could you help me to check whether the overall design is
> reasonable?
> 
> Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
> swap part of the patchset?  Especially [01/10], [04/10], [05/10],
> [06/10], [07/10], [10/10].
Sorry, I am very far from having time to do so.
Hugh
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64
  2016-09-19 17:09     ` Johannes Weiner
@ 2016-09-20  2:01       ` Huang, Ying
  2016-09-22 19:25         ` Johannes Weiner
  0 siblings, 1 reply; 60+ messages in thread
From: Huang, Ying @ 2016-09-20  2:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Anshuman Khandual, Huang, Ying, Andrew Morton, tim.c.chen,
	dave.hansen, andi.kleen, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel
Hi, Johannes,
Johannes Weiner <hannes@cmpxchg.org> writes:
> On Thu, Sep 08, 2016 at 11:15:52AM +0530, Anshuman Khandual wrote:
>> On 09/07/2016 10:16 PM, Huang, Ying wrote:
>> > From: Huang Ying <ying.huang@intel.com>
>> > 
>> > In this patch, the size of the swap cluster is changed to that of the
>> > THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
>> > the THP swap support on x86_64.  Where one swap cluster will be used to
>> > hold the contents of each THP swapped out.  And some information of the
>> > swapped out THP (such as compound map count) will be recorded in the
>> > swap_cluster_info data structure.
>> > 
>> > For other architectures which want THP swap support, THP_SWAP_CLUSTER
>> > need to be selected in the Kconfig file for the architecture.
>> > 
>> > In effect, this will enlarge swap cluster size by 2 times on x86_64.
>> > Which may make it harder to find a free cluster when the swap space
>> > becomes fragmented.  So that, this may reduce the continuous swap space
>> > allocation and sequential write in theory.  The performance test in 0day
>> > shows no regressions caused by this.
>> 
>> This patch needs to be split into two separate ones
>> 
>> (1) Add THP_SWAP_CLUSTER config option
>> (2) Enable CONFIG_THP_SWAP_CLUSTER for X86_64
>
> No, don't do that. This is a bit of an anti-pattern in this series,
> where it introduces a thing in one patch, and a user for it in a later
> patch. However, in order to judge whether that thing is good or not, I
> need to know how exactly it's being used.
>
> So, please, split your series into logical steps, not geographical
> ones. When you introduce a function, config option, symbol, add it
> along with the code that actually *uses* it, in the same patch.
>
> It goes for this patch, but also stuff like the memcg accounting
> functions, get_huge_swap_page() etc.
>
> Start with the logical change, then try to isolate independent changes
> that could make sense even without the rest of the series. If that
> results in a large patch, then so be it. If a big change is hard to
> review, then making me switch back and forth between emails will make
> it harder, not easier, to make make sense of it.
It appears all patches other than [10/10] in the series is used by the
last patch [10/10], directly or indirectly.  And Without [10/10], they
don't make much sense.  So you suggest me to use one large patch?
Something like below?  Does that help you to review?
If other reviewers think this help them to review the code too, I will
send out a formal new version with better patch description.
Best Regards,
Huang, Ying
----------------------------------------------------------->
This patch is to optimize the performance of Transparent Huge Page
(THP) swap.
Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth when do page swap out even on a
high-end server machine.  Because the performance of the storage
device improved faster than that of CPU.  And it seems that the trend
will not change in the near future.  On the other hand, the THP
becomes more and more popular because of increased memory size.  So it
becomes necessary to optimize THP swap performance.
The advantages of the THP swap support include:
- Batch the swap operations for the THP to reduce lock
  acquiring/releasing, including allocating/freeing the swap space,
  adding/deleting to/from the swap cache, and writing/reading the swap
  space, etc.  This will help improve the performance of the THP swap.
- The THP swap space read/write will be 2M sequential IO.  It is
  particularly helpful for the swap read, which usually are 4k random
  IO.  This will improve the performance of the THP swap too.
- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The 2M continuous pages will be
  free up after THP swapping out.
This patch is based on 8/31 head of mmotm/master.
This patch is the first step for the THP swap support.  The plan is
to delay splitting THP step by step, finally avoid splitting THP
during the THP swapping out and swap out/in the THP as a whole.
As the first step, in this patch, the splitting huge page is
delayed from almost the first step of swapping out to after allocating
the swap space for the THP and adding the THP into the swap cache.
This will reduce lock acquiring/releasing for the locks used for the
swap cache management.
With the patch, the swap out throughput improves 12.1% (from about
1.12GB/s to about 1.25GB/s) in the vm-scalability swap-w-seq test case
with 16 processes.  The test is done on a Xeon E5 v3 system.  The swap
device used is a RAM simulated PMEM (persistent memory) device.  To
test the sequential swapping out, the test case uses 16 processes,
which sequentially allocate and write to the anonymous pages until the
RAM and part of the swap device is used up.
The detailed compare result is as follow,
base             base+patch
---------------- -------------------------- 
         %stddev     %change         %stddev
             \          |                \  
   1118821 ±  0%     +12.1%    1254241 ±  1%  vmstat.swap.so
   2460636 ±  1%     +10.6%    2720983 ±  1%  vm-scalability.throughput
    308.79 ±  1%      -7.9%     284.53 ±  1%  vm-scalability.time.elapsed_time
      1639 ±  4%    +232.3%       5446 ±  1%  meminfo.SwapCached
      0.70 ±  3%      +8.7%       0.77 ±  5%  perf-stat.ipc
      9.82 ±  8%     -31.6%       6.72 ±  2%  perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
>From the swap out throughput number, we can find, even tested on a RAM
simulated PMEM (Persistent Memory) device, the swap out throughput can
reach only about 1.1GB/s.  While, in the file IO test, the sequential
write throughput of an Intel P3700 SSD can reach about 1.8GB/s
steadily.  And according the following URL,
https://www-ssl.intel.com/content/www/us/en/solid-state-drives/intel-ssd-dc-family-for-pcie.html
The sequential write throughput of Intel P3608 SSD can reach about
3.0GB/s, while the random read IOPS can reach about 850k.  It is clear
that the bottleneck has moved from the disk to the kernel swap
component itself.
The improved storage device performance should have made the swap
becomes a better feature than before with better performance.  But
because of the issues of kernel swap component itself, the swap
performance is still kept at the low level.  That prevents the swap
feature to be used by more users.  And this in turn causes few kernel
developers think it is necessary to optimize kernel swap component.
To break the loop, we need to optimize the performance of kernel swap
component.  Optimize the THP swap performance is part of it.
Changelog:
v3:
- Per Andrew's suggestion, used a more systematical way to determine
  whether to enable THP swap optimization
- Per Andrew's comments, moved as much as possible code into
  #ifdef CONFIG_TRANSPARENT_HUGE_PAGE/#endif or "if (PageTransHuge())"
- Fixed some coding style warning.
v2:
- Original [1/11] sent separately and merged
- Use switch in 10/10 per Hiff's suggestion
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 arch/x86/Kconfig            |    1 
 include/linux/huge_mm.h     |    6 +
 include/linux/page-flags.h  |    2 
 include/linux/swap.h        |   45 ++++++-
 include/linux/swap_cgroup.h |    6 -
 mm/Kconfig                  |   13 ++
 mm/huge_memory.c            |   26 +++-
 mm/memcontrol.c             |   55 +++++----
 mm/shmem.c                  |    2 
 mm/swap_cgroup.c            |   78 ++++++++++---
 mm/swap_state.c             |  124 +++++++++++++++++----
 mm/swapfile.c               |  259 ++++++++++++++++++++++++++++++++------------
 12 files changed, 471 insertions(+), 146 deletions(-)
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -164,6 +164,7 @@ config X86
 	select HAVE_STACK_VALIDATION		if X86_64
 	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
+	select ARCH_USES_THP_SWAP_CLUSTER	if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -503,6 +503,19 @@ config FRONTSWAP
 
 	  If unsure, say Y to enable frontswap.
 
+config ARCH_USES_THP_SWAP_CLUSTER
+	bool
+	default n
+
+config THP_SWAP_CLUSTER
+	bool
+	depends on SWAP && TRANSPARENT_HUGEPAGE && ARCH_USES_THP_SWAP_CLUSTER
+	default y
+	help
+	  Use one swap cluster to hold the contents of the THP
+	  (Transparent Huge Page) swapped out.  The size of the swap
+	  cluster will be same as that of THP.
+
 config CMA
 	bool "Contiguous Memory Allocator"
 	depends on HAVE_MEMBLOCK && MMU
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -196,7 +196,11 @@ static void discard_swap_cluster(struct
 	}
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+#define SWAPFILE_CLUSTER	(HPAGE_SIZE / PAGE_SIZE)
+#else
 #define SWAPFILE_CLUSTER	256
+#endif
 #define LATENCY_LIMIT		256
 
 static inline void cluster_set_flag(struct swap_cluster_info *info,
@@ -322,6 +326,14 @@ static void swap_cluster_schedule_discar
 	schedule_work(&si->discard_work);
 }
 
+static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info;
+
+	cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
+	cluster_list_add_tail(&si->free_clusters, ci, idx);
+}
+
 /*
  * Doing discard actually. After a cluster discard is finished, the cluster
  * will be added to free cluster list. caller should hold si->lock.
@@ -341,8 +353,7 @@ static void swap_do_scheduled_discard(st
 				SWAPFILE_CLUSTER);
 
 		spin_lock(&si->lock);
-		cluster_set_flag(&info[idx], CLUSTER_FLAG_FREE);
-		cluster_list_add_tail(&si->free_clusters, info, idx);
+		__free_cluster(si, idx);
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
 	}
@@ -359,6 +370,34 @@ static void swap_discard_work(struct wor
 	spin_unlock(&si->lock);
 }
 
+static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info;
+
+	VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
+	cluster_list_del_first(&si->free_clusters, ci);
+	cluster_set_count_flag(ci + idx, 0, 0);
+}
+
+static void free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info + idx;
+
+	VM_BUG_ON(cluster_count(ci) != 0);
+	/*
+	 * If the swap is discardable, prepare discard the cluster
+	 * instead of free it immediately. The cluster will be freed
+	 * after discard.
+	 */
+	if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
+	    (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
+		swap_cluster_schedule_discard(si, idx);
+		return;
+	}
+
+	__free_cluster(si, idx);
+}
+
 /*
  * The cluster corresponding to page_nr will be used. The cluster will be
  * removed from free cluster list and its usage counter will be increased.
@@ -370,11 +409,8 @@ static void inc_cluster_info_page(struct
 
 	if (!cluster_info)
 		return;
-	if (cluster_is_free(&cluster_info[idx])) {
-		VM_BUG_ON(cluster_list_first(&p->free_clusters) != idx);
-		cluster_list_del_first(&p->free_clusters, cluster_info);
-		cluster_set_count_flag(&cluster_info[idx], 0, 0);
-	}
+	if (cluster_is_free(&cluster_info[idx]))
+		alloc_cluster(p, idx);
 
 	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
 	cluster_set_count(&cluster_info[idx],
@@ -398,21 +434,8 @@ static void dec_cluster_info_page(struct
 	cluster_set_count(&cluster_info[idx],
 		cluster_count(&cluster_info[idx]) - 1);
 
-	if (cluster_count(&cluster_info[idx]) == 0) {
-		/*
-		 * If the swap is discardable, prepare discard the cluster
-		 * instead of free it immediately. The cluster will be freed
-		 * after discard.
-		 */
-		if ((p->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
-				 (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
-			swap_cluster_schedule_discard(p, idx);
-			return;
-		}
-
-		cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
-		cluster_list_add_tail(&p->free_clusters, cluster_info, idx);
-	}
+	if (cluster_count(&cluster_info[idx]) == 0)
+		free_cluster(p, idx);
 }
 
 /*
@@ -493,6 +516,69 @@ new_cluster:
 	*scan_base = tmp;
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+static inline unsigned int huge_cluster_nr_entries(bool huge)
+{
+	return huge ? SWAPFILE_CLUSTER : 1;
+}
+#else
+#define huge_cluster_nr_entries(huge)	1
+#endif
+
+static void __swap_entry_alloc(struct swap_info_struct *si,
+			       unsigned long offset, bool huge)
+{
+	unsigned int nr_entries = huge_cluster_nr_entries(huge);
+	unsigned int end = offset + nr_entries - 1;
+
+	if (offset == si->lowest_bit)
+		si->lowest_bit += nr_entries;
+	if (end == si->highest_bit)
+		si->highest_bit -= nr_entries;
+	si->inuse_pages += nr_entries;
+	if (si->inuse_pages == si->pages) {
+		si->lowest_bit = si->max;
+		si->highest_bit = 0;
+		spin_lock(&swap_avail_lock);
+		plist_del(&si->avail_list, &swap_avail_head);
+		spin_unlock(&swap_avail_lock);
+	}
+}
+
+static void __swap_entry_free(struct swap_info_struct *si, unsigned long offset,
+			      bool huge)
+{
+	unsigned int nr_entries = huge_cluster_nr_entries(huge);
+	unsigned long end = offset + nr_entries - 1;
+	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
+
+	if (offset < si->lowest_bit)
+		si->lowest_bit = offset;
+	if (end > si->highest_bit) {
+		bool was_full = !si->highest_bit;
+
+		si->highest_bit = end;
+		if (was_full && (si->flags & SWP_WRITEOK)) {
+			spin_lock(&swap_avail_lock);
+			WARN_ON(!plist_node_empty(&si->avail_list));
+			if (plist_node_empty(&si->avail_list))
+				plist_add(&si->avail_list, &swap_avail_head);
+			spin_unlock(&swap_avail_lock);
+		}
+	}
+	atomic_long_add(nr_entries, &nr_swap_pages);
+	si->inuse_pages -= nr_entries;
+	if (si->flags & SWP_BLKDEV)
+		swap_slot_free_notify =
+			si->bdev->bd_disk->fops->swap_slot_free_notify;
+	while (offset <= end) {
+		frontswap_invalidate_page(si->type, offset);
+		if (swap_slot_free_notify)
+			swap_slot_free_notify(si->bdev, offset);
+		offset++;
+	}
+}
+
 static unsigned long scan_swap_map(struct swap_info_struct *si,
 				   unsigned char usage)
 {
@@ -587,18 +673,7 @@ checks:
 	if (si->swap_map[offset])
 		goto scan;
 
-	if (offset == si->lowest_bit)
-		si->lowest_bit++;
-	if (offset == si->highest_bit)
-		si->highest_bit--;
-	si->inuse_pages++;
-	if (si->inuse_pages == si->pages) {
-		si->lowest_bit = si->max;
-		si->highest_bit = 0;
-		spin_lock(&swap_avail_lock);
-		plist_del(&si->avail_list, &swap_avail_head);
-		spin_unlock(&swap_avail_lock);
-	}
+	__swap_entry_alloc(si, offset, false);
 	si->swap_map[offset] = usage;
 	inc_cluster_info_page(si, si->cluster_info, offset);
 	si->cluster_next = offset + 1;
@@ -645,14 +720,80 @@ no_page:
 	return 0;
 }
 
-swp_entry_t get_swap_page(void)
+#ifdef CONFIG_THP_SWAP_CLUSTER
+static void swap_free_huge_cluster(struct swap_info_struct *si,
+				   unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info + idx;
+	unsigned long offset = idx * SWAPFILE_CLUSTER;
+
+	cluster_set_count_flag(ci, 0, 0);
+	free_cluster(si, idx);
+	__swap_entry_free(si, offset, true);
+}
+
+/*
+ * Caller should hold si->lock.
+ */
+static void swapcache_free_trans_huge(struct swap_info_struct *si,
+				      swp_entry_t entry)
+{
+	unsigned long offset = swp_offset(entry);
+	unsigned long idx = offset / SWAPFILE_CLUSTER;
+	unsigned char *map;
+	unsigned int i;
+
+	map = si->swap_map + offset;
+	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+		VM_BUG_ON(map[i] != SWAP_HAS_CACHE);
+		map[i] &= ~SWAP_HAS_CACHE;
+	}
+	mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
+	swap_free_huge_cluster(si, idx);
+}
+
+static unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si)
+{
+	unsigned long idx;
+	struct swap_cluster_info *ci;
+	unsigned long offset, i;
+	unsigned char *map;
+
+	if (cluster_list_empty(&si->free_clusters))
+		return 0;
+	idx = cluster_list_first(&si->free_clusters);
+	alloc_cluster(si, idx);
+	ci = si->cluster_info + idx;
+	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
+
+	offset = idx * SWAPFILE_CLUSTER;
+	__swap_entry_alloc(si, offset, true);
+	map = si->swap_map + offset;
+	for (i = 0; i < SWAPFILE_CLUSTER; i++)
+		map[i] = SWAP_HAS_CACHE;
+	return offset;
+}
+#else
+static inline unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si)
+{
+	return 0;
+}
+
+static inline void swapcache_free_trans_huge(struct swap_info_struct *si,
+					     swp_entry_t entry)
+{
+}
+#endif
+
+swp_entry_t __get_swap_page(bool huge)
 {
 	struct swap_info_struct *si, *next;
 	pgoff_t offset;
+	int nr_pages = huge_cluster_nr_entries(huge);
 
-	if (atomic_long_read(&nr_swap_pages) <= 0)
+	if (atomic_long_read(&nr_swap_pages) < nr_pages)
 		goto noswap;
-	atomic_long_dec(&nr_swap_pages);
+	atomic_long_sub(nr_pages, &nr_swap_pages);
 
 	spin_lock(&swap_avail_lock);
 
@@ -680,10 +821,15 @@ start_over:
 		}
 
 		/* This is called for allocating swap entry for cache */
-		offset = scan_swap_map(si, SWAP_HAS_CACHE);
+		if (likely(nr_pages == 1))
+			offset = scan_swap_map(si, SWAP_HAS_CACHE);
+		else
+			offset = swap_alloc_huge_cluster(si);
 		spin_unlock(&si->lock);
 		if (offset)
 			return swp_entry(si->type, offset);
+		else if (unlikely(nr_pages != 1))
+			goto fail_alloc;
 		pr_debug("scan_swap_map of si %d failed to find offset\n",
 		       si->type);
 		spin_lock(&swap_avail_lock);
@@ -703,8 +849,8 @@ nextsi:
 	}
 
 	spin_unlock(&swap_avail_lock);
-
-	atomic_long_inc(&nr_swap_pages);
+fail_alloc:
+	atomic_long_add(nr_pages, &nr_swap_pages);
 noswap:
 	return (swp_entry_t) {0};
 }
@@ -802,31 +948,9 @@ static unsigned char swap_entry_free(str
 
 	/* free if no reference */
 	if (!usage) {
-		mem_cgroup_uncharge_swap(entry);
+		mem_cgroup_uncharge_swap(entry, 1);
 		dec_cluster_info_page(p, p->cluster_info, offset);
-		if (offset < p->lowest_bit)
-			p->lowest_bit = offset;
-		if (offset > p->highest_bit) {
-			bool was_full = !p->highest_bit;
-			p->highest_bit = offset;
-			if (was_full && (p->flags & SWP_WRITEOK)) {
-				spin_lock(&swap_avail_lock);
-				WARN_ON(!plist_node_empty(&p->avail_list));
-				if (plist_node_empty(&p->avail_list))
-					plist_add(&p->avail_list,
-						  &swap_avail_head);
-				spin_unlock(&swap_avail_lock);
-			}
-		}
-		atomic_long_inc(&nr_swap_pages);
-		p->inuse_pages--;
-		frontswap_invalidate_page(p->type, offset);
-		if (p->flags & SWP_BLKDEV) {
-			struct gendisk *disk = p->bdev->bd_disk;
-			if (disk->fops->swap_slot_free_notify)
-				disk->fops->swap_slot_free_notify(p->bdev,
-								  offset);
-		}
+		__swap_entry_free(p, offset, false);
 	}
 
 	return usage;
@@ -850,13 +974,16 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry)
+void __swapcache_free(swp_entry_t entry, bool huge)
 {
 	struct swap_info_struct *p;
 
 	p = swap_info_get(entry);
 	if (p) {
-		swap_entry_free(p, entry, SWAP_HAS_CACHE);
+		if (unlikely(huge))
+			swapcache_free_trans_huge(p, entry);
+		else
+			swap_entry_free(p, entry, SWAP_HAS_CACHE);
 		spin_unlock(&p->lock);
 	}
 }
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -18,6 +18,13 @@ struct swap_cgroup {
 };
 #define SC_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
 
+struct swap_cgroup_iter {
+	struct swap_cgroup_ctrl *ctrl;
+	struct swap_cgroup *sc;
+	swp_entry_t entry;
+	unsigned long flags;
+};
+
 /*
  * SwapCgroup implements "lookup" and "exchange" operations.
  * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
@@ -75,6 +82,35 @@ static struct swap_cgroup *lookup_swap_c
 	return sc + offset % SC_PER_PAGE;
 }
 
+static void swap_cgroup_iter_init(struct swap_cgroup_iter *iter,
+				  swp_entry_t ent)
+{
+	iter->entry = ent;
+	iter->sc = lookup_swap_cgroup(ent, &iter->ctrl);
+	spin_lock_irqsave(&iter->ctrl->lock, iter->flags);
+}
+
+static void swap_cgroup_iter_exit(struct swap_cgroup_iter *iter)
+{
+	spin_unlock_irqrestore(&iter->ctrl->lock, iter->flags);
+}
+
+/*
+ * swap_cgroup is stored in a kind of discontinuous array.  That is,
+ * they are continuous in one page, but not across page boundary.  And
+ * there is one lock for each page.
+ */
+static void swap_cgroup_iter_advance(struct swap_cgroup_iter *iter)
+{
+	iter->sc++;
+	iter->entry.val++;
+	if (!(((unsigned long)iter->sc) & PAGE_MASK)) {
+		spin_unlock_irqrestore(&iter->ctrl->lock, iter->flags);
+		iter->sc = lookup_swap_cgroup(iter->entry, &iter->ctrl);
+		spin_lock_irqsave(&iter->ctrl->lock, iter->flags);
+	}
+}
+
 /**
  * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
  * @ent: swap entry to be cmpxchged
@@ -87,45 +123,49 @@ static struct swap_cgroup *lookup_swap_c
 unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 					unsigned short old, unsigned short new)
 {
-	struct swap_cgroup_ctrl *ctrl;
-	struct swap_cgroup *sc;
-	unsigned long flags;
+	struct swap_cgroup_iter iter;
 	unsigned short retval;
 
-	sc = lookup_swap_cgroup(ent, &ctrl);
+	swap_cgroup_iter_init(&iter, ent);
 
-	spin_lock_irqsave(&ctrl->lock, flags);
-	retval = sc->id;
+	retval = iter.sc->id;
 	if (retval == old)
-		sc->id = new;
+		iter.sc->id = new;
 	else
 		retval = 0;
-	spin_unlock_irqrestore(&ctrl->lock, flags);
+
+	swap_cgroup_iter_exit(&iter);
 	return retval;
 }
 
 /**
- * swap_cgroup_record - record mem_cgroup for this swp_entry.
- * @ent: swap entry to be recorded into
+ * swap_cgroup_record - record mem_cgroup for a set of swap entries
+ * @ent: the first swap entry to be recorded into
  * @id: mem_cgroup to be recorded
+ * @nr_ents: number of swap entries to be recorded
  *
  * Returns old value at success, 0 at failure.
  * (Of course, old value can be 0.)
  */
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
+unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
+				  unsigned int nr_ents)
 {
-	struct swap_cgroup_ctrl *ctrl;
-	struct swap_cgroup *sc;
+	struct swap_cgroup_iter iter;
 	unsigned short old;
-	unsigned long flags;
 
-	sc = lookup_swap_cgroup(ent, &ctrl);
+	swap_cgroup_iter_init(&iter, ent);
 
-	spin_lock_irqsave(&ctrl->lock, flags);
-	old = sc->id;
-	sc->id = id;
-	spin_unlock_irqrestore(&ctrl->lock, flags);
+	old = iter.sc->id;
+	for (;;) {
+		VM_BUG_ON(iter.sc->id != old);
+		iter.sc->id = id;
+		nr_ents--;
+		if (!nr_ents)
+			break;
+		swap_cgroup_iter_advance(&iter);
+	}
 
+	swap_cgroup_iter_exit(&iter);
 	return old;
 }
 
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -399,14 +399,14 @@ static inline long get_nr_swap_pages(voi
 }
 
 extern void si_swapinfo(struct sysinfo *);
-extern swp_entry_t get_swap_page(void);
+extern swp_entry_t __get_swap_page(bool huge);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t);
+extern void __swapcache_free(swp_entry_t, bool);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
@@ -419,6 +419,23 @@ extern bool reuse_swap_page(struct page
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 
+static inline swp_entry_t get_swap_page(void)
+{
+	return __get_swap_page(false);
+}
+
+#ifdef CONFIG_THP_SWAP_CLUSTER
+static inline swp_entry_t get_huge_swap_page(void)
+{
+	return __get_swap_page(true);
+}
+#else
+static inline swp_entry_t get_huge_swap_page(void)
+{
+	return (swp_entry_t) {0};
+}
+#endif
+
 #else /* CONFIG_SWAP */
 
 #define swap_address_space(entry)		(NULL)
@@ -461,7 +478,7 @@ static inline void swap_free(swp_entry_t
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp)
+static inline void __swapcache_free(swp_entry_t swp, bool huge)
 {
 }
 
@@ -525,8 +542,18 @@ static inline swp_entry_t get_swap_page(
 	return entry;
 }
 
+static inline swp_entry_t get_huge_swap_page(void)
+{
+	return (swp_entry_t) {0};
+}
+
 #endif /* CONFIG_SWAP */
 
+static inline void swapcache_free(swp_entry_t entry)
+{
+	__swapcache_free(entry, false);
+}
+
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
@@ -550,8 +577,10 @@ static inline int mem_cgroup_swappiness(
 
 #ifdef CONFIG_MEMCG_SWAP
 extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
-extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry);
-extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
+extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry,
+				      unsigned int nr_entries);
+extern void mem_cgroup_uncharge_swap(swp_entry_t entry,
+				     unsigned int nr_entries);
 extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg);
 extern bool mem_cgroup_swap_full(struct page *page);
 #else
@@ -560,12 +589,14 @@ static inline void mem_cgroup_swapout(st
 }
 
 static inline int mem_cgroup_try_charge_swap(struct page *page,
-					     swp_entry_t entry)
+					     swp_entry_t entry,
+					     unsigned int nr_entries)
 {
 	return 0;
 }
 
-static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
+static inline void mem_cgroup_uncharge_swap(swp_entry_t entry,
+					    unsigned int nr_entries)
 {
 }
 
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -7,7 +7,8 @@
 
 extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 					unsigned short old, unsigned short new);
-extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
+extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
+					 unsigned int nr_ents);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
 extern void swap_cgroup_swapoff(int type);
@@ -15,7 +16,8 @@ extern void swap_cgroup_swapoff(int type
 #else
 
 static inline
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
+unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
+				  unsigned int nr_ents)
 {
 	return 0;
 }
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2370,10 +2370,9 @@ void mem_cgroup_split_huge_fixup(struct
 
 #ifdef CONFIG_MEMCG_SWAP
 static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
-					 bool charge)
+				       int nr_entries)
 {
-	int val = (charge) ? 1 : -1;
-	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
+	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], nr_entries);
 }
 
 /**
@@ -2399,8 +2398,8 @@ static int mem_cgroup_move_swap_account(
 	new_id = mem_cgroup_id(to);
 
 	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
-		mem_cgroup_swap_statistics(from, false);
-		mem_cgroup_swap_statistics(to, true);
+		mem_cgroup_swap_statistics(from, -1);
+		mem_cgroup_swap_statistics(to, 1);
 		return 0;
 	}
 	return -EINVAL;
@@ -5417,7 +5416,7 @@ void mem_cgroup_commit_charge(struct pag
 		 * let's not wait for it.  The page already received a
 		 * memory+swap charge, drop the swap entry duplicate.
 		 */
-		mem_cgroup_uncharge_swap(entry);
+		mem_cgroup_uncharge_swap(entry, nr_pages);
 	}
 }
 
@@ -5825,9 +5824,9 @@ void mem_cgroup_swapout(struct page *pag
 	 * ancestor for the swap instead and transfer the memory+swap charge.
 	 */
 	swap_memcg = mem_cgroup_id_get_online(memcg);
-	oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg));
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg), 1);
 	VM_BUG_ON_PAGE(oldid, page);
-	mem_cgroup_swap_statistics(swap_memcg, true);
+	mem_cgroup_swap_statistics(swap_memcg, 1);
 
 	page->mem_cgroup = NULL;
 
@@ -5854,16 +5853,19 @@ void mem_cgroup_swapout(struct page *pag
 		css_put(&memcg->css);
 }
 
-/*
- * mem_cgroup_try_charge_swap - try charging a swap entry
+/**
+ * mem_cgroup_try_charge_swap - try charging a set of swap entries
  * @page: page being added to swap
- * @entry: swap entry to charge
+ * @entry: the first swap entry to charge
+ * @nr_entries: the number of swap entries to charge
  *
- * Try to charge @entry to the memcg that @page belongs to.
+ * Try to charge @nr_entries swap entries starting from @entry to the
+ * memcg that @page belongs to.
  *
  * Returns 0 on success, -ENOMEM on failure.
  */
-int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
+int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry,
+			       unsigned int nr_entries)
 {
 	struct mem_cgroup *memcg;
 	struct page_counter *counter;
@@ -5881,25 +5883,29 @@ int mem_cgroup_try_charge_swap(struct pa
 	memcg = mem_cgroup_id_get_online(memcg);
 
 	if (!mem_cgroup_is_root(memcg) &&
-	    !page_counter_try_charge(&memcg->swap, 1, &counter)) {
+	    !page_counter_try_charge(&memcg->swap, nr_entries, &counter)) {
 		mem_cgroup_id_put(memcg);
 		return -ENOMEM;
 	}
 
-	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
+	if (nr_entries > 1)
+		mem_cgroup_id_get_many(memcg, nr_entries - 1);
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg), nr_entries);
 	VM_BUG_ON_PAGE(oldid, page);
-	mem_cgroup_swap_statistics(memcg, true);
+	mem_cgroup_swap_statistics(memcg, nr_entries);
 
 	return 0;
 }
 
 /**
- * mem_cgroup_uncharge_swap - uncharge a swap entry
- * @entry: swap entry to uncharge
+ * mem_cgroup_uncharge_swap - uncharge a set of swap entries
+ * @entry: the first swap entry to uncharge
+ * @nr_entries: the number of swap entries to uncharge
  *
- * Drop the swap charge associated with @entry.
+ * Drop the swap charge associated with @nr_entries swap entries
+ * starting from @entry.
  */
-void mem_cgroup_uncharge_swap(swp_entry_t entry)
+void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_entries)
 {
 	struct mem_cgroup *memcg;
 	unsigned short id;
@@ -5907,17 +5913,18 @@ void mem_cgroup_uncharge_swap(swp_entry_
 	if (!do_swap_account)
 		return;
 
-	id = swap_cgroup_record(entry, 0);
+	id = swap_cgroup_record(entry, 0, nr_entries);
 	rcu_read_lock();
 	memcg = mem_cgroup_from_id(id);
 	if (memcg) {
 		if (!mem_cgroup_is_root(memcg)) {
 			if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
-				page_counter_uncharge(&memcg->swap, 1);
+				page_counter_uncharge(&memcg->swap, nr_entries);
 			else
-				page_counter_uncharge(&memcg->memsw, 1);
+				page_counter_uncharge(&memcg->memsw,
+						      nr_entries);
 		}
-		mem_cgroup_swap_statistics(memcg, false);
+		mem_cgroup_swap_statistics(memcg, -nr_entries);
 		mem_cgroup_id_put(memcg);
 	}
 	rcu_read_unlock();
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1248,7 +1248,7 @@ static int shmem_writepage(struct page *
 	if (!swap.val)
 		goto redirty;
 
-	if (mem_cgroup_try_charge_swap(page, swap))
+	if (mem_cgroup_try_charge_swap(page, swap, 1))
 		goto free_swap;
 
 	/*
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -17,6 +17,7 @@
 #include <linux/blkdev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
+#include <linux/huge_mm.h>
 
 #include <asm/pgtable.h>
 
@@ -43,6 +44,7 @@ struct address_space swapper_spaces[MAX_
 };
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
+#define ADD_CACHE_INFO(x, nr)	do { swap_cache_info.x += (nr); } while (0)
 
 static struct {
 	unsigned long add_total;
@@ -80,25 +82,32 @@ void show_swap_cache_info(void)
  */
 int __add_to_swap_cache(struct page *page, swp_entry_t entry)
 {
-	int error;
+	int error, i, nr = hpage_nr_pages(page);
 	struct address_space *address_space;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapCache(page), page);
 	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
-	get_page(page);
+	page_ref_add(page, nr);
 	SetPageSwapCache(page);
-	set_page_private(page, entry.val);
 
 	address_space = swap_address_space(entry);
 	spin_lock_irq(&address_space->tree_lock);
-	error = radix_tree_insert(&address_space->page_tree,
-					entry.val, page);
+	for (i = 0; i < nr; i++) {
+		struct page *cur_page = page + i;
+		unsigned long index = entry.val + i;
+
+		set_page_private(cur_page, index);
+		error = radix_tree_insert(&address_space->page_tree,
+					  index, cur_page);
+		if (unlikely(error))
+			break;
+	}
 	if (likely(!error)) {
-		address_space->nrpages++;
-		__inc_node_page_state(page, NR_FILE_PAGES);
-		INC_CACHE_INFO(add_total);
+		address_space->nrpages += nr;
+		__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
+		ADD_CACHE_INFO(add_total, nr);
 	}
 	spin_unlock_irq(&address_space->tree_lock);
 
@@ -109,9 +118,16 @@ int __add_to_swap_cache(struct page *pag
 		 * So add_to_swap_cache() doesn't returns -EEXIST.
 		 */
 		VM_BUG_ON(error == -EEXIST);
-		set_page_private(page, 0UL);
 		ClearPageSwapCache(page);
-		put_page(page);
+		set_page_private(page + i, 0UL);
+		while (i--) {
+			struct page *cur_page = page + i;
+			unsigned long index = entry.val + i;
+
+			set_page_private(cur_page, 0UL);
+			radix_tree_delete(&address_space->page_tree, index);
+		}
+		page_ref_sub(page, nr);
 	}
 
 	return error;
@@ -122,7 +138,7 @@ int add_to_swap_cache(struct page *page,
 {
 	int error;
 
-	error = radix_tree_maybe_preload(gfp_mask);
+	error = radix_tree_maybe_preload_order(gfp_mask, compound_order(page));
 	if (!error) {
 		error = __add_to_swap_cache(page, entry);
 		radix_tree_preload_end();
@@ -138,6 +154,7 @@ void __delete_from_swap_cache(struct pag
 {
 	swp_entry_t entry;
 	struct address_space *address_space;
+	int i, nr = hpage_nr_pages(page);
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
@@ -145,20 +162,66 @@ void __delete_from_swap_cache(struct pag
 
 	entry.val = page_private(page);
 	address_space = swap_address_space(entry);
-	radix_tree_delete(&address_space->page_tree, page_private(page));
-	set_page_private(page, 0);
 	ClearPageSwapCache(page);
-	address_space->nrpages--;
-	__dec_node_page_state(page, NR_FILE_PAGES);
-	INC_CACHE_INFO(del_total);
+	for (i = 0; i < nr; i++) {
+		struct page *cur_page = page + i;
+
+		radix_tree_delete(&address_space->page_tree,
+				  page_private(cur_page));
+		set_page_private(cur_page, 0);
+	}
+	address_space->nrpages -= nr;
+	__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
+	ADD_CACHE_INFO(del_total, nr);
+}
+
+#ifdef CONFIG_THP_SWAP_CLUSTER
+int add_to_swap_trans_huge(struct page *page, struct list_head *list)
+{
+	swp_entry_t entry;
+	int ret = 0;
+
+	/* cannot split, which may be needed during swap in, skip it */
+	if (!can_split_huge_page(page))
+		return -EBUSY;
+	/* fallback to split huge page firstly if no PMD map */
+	if (!compound_mapcount(page))
+		return 0;
+	entry = get_huge_swap_page();
+	if (!entry.val)
+		return 0;
+	if (mem_cgroup_try_charge_swap(page, entry, HPAGE_PMD_NR)) {
+		__swapcache_free(entry, true);
+		return -EOVERFLOW;
+	}
+	ret = add_to_swap_cache(page, entry,
+				__GFP_HIGH | __GFP_NOMEMALLOC|__GFP_NOWARN);
+	/* -ENOMEM radix-tree allocation failure */
+	if (ret) {
+		__swapcache_free(entry, true);
+		return 0;
+	}
+	ret = split_huge_page_to_list(page, list);
+	if (ret) {
+		delete_from_swap_cache(page);
+		return -EBUSY;
+	}
+	return 1;
+}
+#else
+static inline int add_to_swap_trans_huge(struct page *page,
+					 struct list_head *list)
+{
+	return 0;
 }
+#endif
 
 /**
  * add_to_swap - allocate swap space for a page
  * @page: page we want to move to swap
  *
  * Allocate swap space for the page and add the page to the
- * swap cache.  Caller needs to hold the page lock. 
+ * swap cache.  Caller needs to hold the page lock.
  */
 int add_to_swap(struct page *page, struct list_head *list)
 {
@@ -168,11 +231,23 @@ int add_to_swap(struct page *page, struc
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageUptodate(page), page);
 
+	if (unlikely(PageTransHuge(page))) {
+		err = add_to_swap_trans_huge(page, list);
+		switch (err) {
+		case 1:
+			return 1;
+		case 0:
+			/* fallback to split firstly if return 0 */
+			break;
+		default:
+			return 0;
+		}
+	}
 	entry = get_swap_page();
 	if (!entry.val)
 		return 0;
 
-	if (mem_cgroup_try_charge_swap(page, entry)) {
+	if (mem_cgroup_try_charge_swap(page, entry, 1)) {
 		swapcache_free(entry);
 		return 0;
 	}
@@ -227,8 +302,8 @@ void delete_from_swap_cache(struct page
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	swapcache_free(entry);
-	put_page(page);
+	__swapcache_free(entry, PageTransHuge(page));
+	page_ref_sub(page, hpage_nr_pages(page));
 }
 
 /* 
@@ -285,7 +360,7 @@ struct page * lookup_swap_cache(swp_entr
 
 	page = find_get_page(swap_address_space(entry), entry.val);
 
-	if (page) {
+	if (page && likely(!PageTransCompound(page))) {
 		INC_CACHE_INFO(find_success);
 		if (TestClearPageReadahead(page))
 			atomic_inc(&swapin_readahead_hits);
@@ -311,8 +386,13 @@ struct page *__read_swap_cache_async(swp
 		 * that would confuse statistics.
 		 */
 		found_page = find_get_page(swapper_space, entry.val);
-		if (found_page)
+		if (found_page) {
+			if (unlikely(PageTransCompound(found_page))) {
+				put_page(found_page);
+				found_page = NULL;
+			}
 			break;
+		}
 
 		/*
 		 * Get a new page to read into from swap.
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -314,7 +314,7 @@ PAGEFLAG_FALSE(HighMem)
 #endif
 
 #ifdef CONFIG_SWAP
-PAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
+PAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
 #else
 PAGEFLAG_FALSE(SwapCache)
 #endif
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -94,6 +94,7 @@ extern unsigned long thp_get_unmapped_ar
 extern void prep_transhuge_page(struct page *page);
 extern void free_transhuge_page(struct page *page);
 
+bool can_split_huge_page(struct page *page);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
@@ -176,6 +177,11 @@ static inline void prep_transhuge_page(s
 
 #define thp_get_unmapped_area	NULL
 
+static inline bool
+can_split_huge_page(struct page *page)
+{
+	return false;
+}
 static inline int
 split_huge_page_to_list(struct page *page, struct list_head *list)
 {
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1834,7 +1834,7 @@ static void __split_huge_page_tail(struc
 	 * atomic_set() here would be safe on all archs (and not only on x86),
 	 * it's safer to use atomic_inc()/atomic_add().
 	 */
-	if (PageAnon(head)) {
+	if (PageAnon(head) && !PageSwapCache(head)) {
 		page_ref_inc(page_tail);
 	} else {
 		/* Additional pin to radix tree */
@@ -1845,6 +1845,7 @@ static void __split_huge_page_tail(struc
 	page_tail->flags |= (head->flags &
 			((1L << PG_referenced) |
 			 (1L << PG_swapbacked) |
+			 (1L << PG_swapcache) |
 			 (1L << PG_mlocked) |
 			 (1L << PG_uptodate) |
 			 (1L << PG_active) |
@@ -1907,7 +1908,11 @@ static void __split_huge_page(struct pag
 	ClearPageCompound(head);
 	/* See comment in __split_huge_page_tail() */
 	if (PageAnon(head)) {
-		page_ref_inc(head);
+		/* Additional pin to radix tree of swap cache */
+		if (PageSwapCache(head))
+			page_ref_add(head, 2);
+		else
+			page_ref_inc(head);
 	} else {
 		/* Additional pin to radix tree */
 		page_ref_add(head, 2);
@@ -2016,6 +2021,19 @@ int page_trans_huge_mapcount(struct page
 	return ret;
 }
 
+/* Racy check whether the huge page can be split */
+bool can_split_huge_page(struct page *page)
+{
+	int extra_pins;
+
+	/* Additional pins from radix tree */
+	if (PageAnon(page))
+		extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0;
+	else
+		extra_pins = HPAGE_PMD_NR;
+	return total_mapcount(page) == page_count(page) - extra_pins - 1;
+}
+
 /*
  * This function splits huge page into normal pages. @page can point to any
  * subpage of huge page to split. Split doesn't change the position of @page.
@@ -2064,7 +2082,7 @@ int split_huge_page_to_list(struct page
 			ret = -EBUSY;
 			goto out;
 		}
-		extra_pins = 0;
+		extra_pins = PageSwapCache(head) ? HPAGE_PMD_NR : 0;
 		mapping = NULL;
 		anon_vma_lock_write(anon_vma);
 	} else {
@@ -2086,7 +2104,7 @@ int split_huge_page_to_list(struct page
 	 * Racy check if we can split the page, before freeze_page() will
 	 * split PMDs
 	 */
-	if (total_mapcount(head) != page_count(head) - extra_pins - 1) {
+	if (!can_split_huge_page(head)) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-19  7:08                 ` Minchan Kim
@ 2016-09-20  2:54                   ` Huang, Ying
  2016-09-20  5:06                     ` Minchan Kim
  0 siblings, 1 reply; 60+ messages in thread
From: Huang, Ying @ 2016-09-20  2:54 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko
Hi, Minchan,
Minchan Kim <minchan@kernel.org> writes:
> Hi Huang,
>
> On Sun, Sep 18, 2016 at 09:53:39AM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > On Tue, Sep 13, 2016 at 04:53:49PM +0800, Huang, Ying wrote:
>> >> Minchan Kim <minchan@kernel.org> writes:
>> >> > On Tue, Sep 13, 2016 at 02:40:00PM +0800, Huang, Ying wrote:
>> >> >> Minchan Kim <minchan@kernel.org> writes:
>> >> >> 
>> >> >> > Hi Huang,
>> >> >> >
>> >> >> > On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
>> >> >> >
[snip]
>> >> > 1. If we solve batching swapout, then how is THP split for swapout bad?
>> >> > 2. Also, how is current conservatie swapin from khugepaged bad?
>> >> >
>> >> > I think it's one of decision point for the motivation of your work
>> >> > and for 1, we need batching swapout feature.
>> >> >
>> >> > I am saying again that I'm not against your goal but only concern
>> >> > is approach. If you don't agree, please ignore me.
>> >> 
>> >> I am glad to discuss my final goal, that is, swapping out/in the full
>> >> THP without splitting.  Why I want to do that is copied as below,
>> >
>> > Yes, it's your *final* goal but what if it couldn't be acceptable
>> > on second step you mentioned above, for example?
>> >
>> >         Unncessary binded implementation to rejected work.
>> 
>> So I want to discuss my final goal.  If people accept my final goal,
>> this is resolved.  If people don't accept, I will reconsider it.
>
> No.
>
> Please keep it in mind. There are lots of factors the project would
> be broken during going on by several reasons because we are human being
> so we can simply miss something clear and realize it later that it's
> not feasible. Otherwise, others can show up with better idea for the
> goal or fix other subsystem which can affect your goals.
> I don't want to say such boring theoretical stuffs any more.
>
> My point is patchset should be self-contained if you really want to go
> with step-by-step approach because we are likely to miss something
> *easily*.
>
>> 
>> > If you want to achieve your goal step by step, please consider if
>> > one of step you are thinking could be rejected but steps already
>> > merged should be self-contained without side-effect.
>> 
>> What is the side-effect or possible regressions of the step 1 as in this
>
> Adding code complexity for unproved feature.
>
> When I read your steps, your *most important* goal is to avoid split/
> collapsing anon THP page for swap out/in. As a bonus with the approach,
> we could increase swapout/in bandwidth, too. Do I understand correctly?
It's hard to say what is the *most important* goal.  But it is clear
that to improve swapout/in performance isn't the only goal.  The other
goal to avoid split/collapsing THP page for swap out/in is very
important too.
> However, swap-in/out bandwidth enhance is common requirement for both
> normal and THP page and with Tim's work, we could enhance swapout path.
>
> So, I think you should give us to number about how THP split is bad
> for the swapout bandwidth even though we applied Tim's work.
> If it's serious, next approach is yours that we could tweak swap code
> be aware of a THP to avoid splitting a THP.
It's not only about CPU cycles spent in splitting and collapsing THP,
but also how to make THP work effectively on systems with swap turned
on.
To avoid disturbing user applications etc., THP collapsing doesn't work
aggressively to collapse anonymous pages into THP.  This means, once the
THP is split, it will take quite long time (wall time, instead of CPU
cycles) to be collapsed to become a THP, especially on machines with
large memory size.  And on systems with swap turned on, THP will be
split during swap out/in now.  If much swapping out/in is triggered
during system running, it is possible that many THP is split, and have
no chance to be collapsed.  Even if the THP that has been split gets
opportunity to be collapsed again, the applications lose the opportunity
to take advantage of the THP for quite long time too.  And the memory
will be fragmented during the process, this makes it hard to allocate
new THP.  The end result is that THP usage is very low in this
situation.  One solution is to avoid to split/collapse THP during swap
out/in.
> For THP swap-in, I think it's another topic we should discuss.
> For each step, it's orthogonal work so it shouldn't rely on next goal.
>
>
>> patchset?  Lacks the opportunity to allocate consecutive 512 swap slots
>> in 2 non-free swap clusters?  I don't think that is a regression,
>> because the patchset will NOT make free swap clusters consumed faster
>> than that in current code.  Even if it were better to allocate
>> consecutive 512 swap slots in 2 non-free swap clusters, it could be an
>> incremental improvement to the simple solution in this patchset.  That
>> is, to allocate 512 swap slots, the simple solution is:
>> 
>> a) Try to allocate a free swap cluster
>> b) If a) fails, give up
>> 
>> The improved solution could be (if it were needed finally)
>> 
>> a) Try to allocate a free swap cluster
>> b) If a) fails, try to allocate consecutive 512 swap slots in 2 non-free
>>    swap clusters
>> c) If b) fails, give up
>
> I didn't mean it. Please read above.
>
>> 
>> > If it's hard, send full patchset all at once so reviewers can think
>> > what you want of right direction and implementation is good for it.
>> 
>> Thanks for suggestion.
>
> Huang,
>
> I'm sorry if I misunderstand something. And I should admit I'm not a THP
> user even so I'm blind on a THP workload so sorry too if I miss really
> something clear. However, my concern is adding more complexity to swap
> layer without justfication and to me, it's really hard to understand your
> motivation from your description.
>
> If you want step by step approach, for the first step, please prove
> how THP split is bad in swapout path and it would be better to consider
> how to make codes shareable with normal pages batching so THP awareness
> on top of normal page batching, it would be more easy to prove/review,
> I think.
If it were needed by normal pages batching, the free swap cluster
allocating/freeing functions in this patchset could be reused by normal
pages batching I think.
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-20  2:54                   ` Huang, Ying
@ 2016-09-20  5:06                     ` Minchan Kim
  2016-09-20  5:28                       ` Huang, Ying
  0 siblings, 1 reply; 60+ messages in thread
From: Minchan Kim @ 2016-09-20  5:06 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko
Hi Huang,
On Tue, Sep 20, 2016 at 10:54:35AM +0800, Huang, Ying wrote:
> Hi, Minchan,
> 
> Minchan Kim <minchan@kernel.org> writes:
> > Hi Huang,
> >
> > On Sun, Sep 18, 2016 at 09:53:39AM +0800, Huang, Ying wrote:
> >> Minchan Kim <minchan@kernel.org> writes:
> >> 
> >> > On Tue, Sep 13, 2016 at 04:53:49PM +0800, Huang, Ying wrote:
> >> >> Minchan Kim <minchan@kernel.org> writes:
> >> >> > On Tue, Sep 13, 2016 at 02:40:00PM +0800, Huang, Ying wrote:
> >> >> >> Minchan Kim <minchan@kernel.org> writes:
> >> >> >> 
> >> >> >> > Hi Huang,
> >> >> >> >
> >> >> >> > On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
> >> >> >> >
> 
> [snip]
> 
> >> >> > 1. If we solve batching swapout, then how is THP split for swapout bad?
> >> >> > 2. Also, how is current conservatie swapin from khugepaged bad?
> >> >> >
> >> >> > I think it's one of decision point for the motivation of your work
> >> >> > and for 1, we need batching swapout feature.
> >> >> >
> >> >> > I am saying again that I'm not against your goal but only concern
> >> >> > is approach. If you don't agree, please ignore me.
> >> >> 
> >> >> I am glad to discuss my final goal, that is, swapping out/in the full
> >> >> THP without splitting.  Why I want to do that is copied as below,
> >> >
> >> > Yes, it's your *final* goal but what if it couldn't be acceptable
> >> > on second step you mentioned above, for example?
> >> >
> >> >         Unncessary binded implementation to rejected work.
> >> 
> >> So I want to discuss my final goal.  If people accept my final goal,
> >> this is resolved.  If people don't accept, I will reconsider it.
> >
> > No.
> >
> > Please keep it in mind. There are lots of factors the project would
> > be broken during going on by several reasons because we are human being
> > so we can simply miss something clear and realize it later that it's
> > not feasible. Otherwise, others can show up with better idea for the
> > goal or fix other subsystem which can affect your goals.
> > I don't want to say such boring theoretical stuffs any more.
> >
> > My point is patchset should be self-contained if you really want to go
> > with step-by-step approach because we are likely to miss something
> > *easily*.
> >
> >> 
> >> > If you want to achieve your goal step by step, please consider if
> >> > one of step you are thinking could be rejected but steps already
> >> > merged should be self-contained without side-effect.
> >> 
> >> What is the side-effect or possible regressions of the step 1 as in this
> >
> > Adding code complexity for unproved feature.
> >
> > When I read your steps, your *most important* goal is to avoid split/
> > collapsing anon THP page for swap out/in. As a bonus with the approach,
> > we could increase swapout/in bandwidth, too. Do I understand correctly?
> 
> It's hard to say what is the *most important* goal.  But it is clear
> that to improve swapout/in performance isn't the only goal.  The other
> goal to avoid split/collapsing THP page for swap out/in is very
> important too.
Okay, then, couldn't you focus a goal in patchset? After solving a problem,
then next one. What's the problem?
One of your goal is swapout performance and it's same with Tim's work.
That's why I wanted to make your patchset based on Tim's work. But if you
want your patch first, please make patchset independent with your other goal
so everyone can review easily and focus on *a* problem.
In your patchset, THP split delaying part could be folded into in your second
patchset which is to avoid THP split/collapsing.
> 
> > However, swap-in/out bandwidth enhance is common requirement for both
> > normal and THP page and with Tim's work, we could enhance swapout path.
> >
> > So, I think you should give us to number about how THP split is bad
> > for the swapout bandwidth even though we applied Tim's work.
> > If it's serious, next approach is yours that we could tweak swap code
> > be aware of a THP to avoid splitting a THP.
> 
> It's not only about CPU cycles spent in splitting and collapsing THP,
> but also how to make THP work effectively on systems with swap turned
> on.
> 
> To avoid disturbing user applications etc., THP collapsing doesn't work
> aggressively to collapse anonymous pages into THP.  This means, once the
> THP is split, it will take quite long time (wall time, instead of CPU
> cycles) to be collapsed to become a THP, especially on machines with
> large memory size.  And on systems with swap turned on, THP will be
> split during swap out/in now.  If much swapping out/in is triggered
> during system running, it is possible that many THP is split, and have
> no chance to be collapsed.  Even if the THP that has been split gets
> opportunity to be collapsed again, the applications lose the opportunity
> to take advantage of the THP for quite long time too.  And the memory
> will be fragmented during the process, this makes it hard to allocate
> new THP.  The end result is that THP usage is very low in this
> situation.  One solution is to avoid to split/collapse THP during swap
> out/in.
I understand what you want. I have a few questions for the goal but
will not ask now because I want to see more in your description to
understand current situation well.
Huang, please, don't mix your goals in a patchset and include your
claim with number we can justify. It would make more reviewer happy.
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-20  5:06                     ` Minchan Kim
@ 2016-09-20  5:28                       ` Huang, Ying
  0 siblings, 0 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-20  5:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko
Minchan Kim <minchan@kernel.org> writes:
> Hi Huang,
>
> On Tue, Sep 20, 2016 at 10:54:35AM +0800, Huang, Ying wrote:
>> Hi, Minchan,
>> 
>> Minchan Kim <minchan@kernel.org> writes:
>> > Hi Huang,
>> >
>> > On Sun, Sep 18, 2016 at 09:53:39AM +0800, Huang, Ying wrote:
>> >> Minchan Kim <minchan@kernel.org> writes:
>> >> 
>> >> > On Tue, Sep 13, 2016 at 04:53:49PM +0800, Huang, Ying wrote:
>> >> >> Minchan Kim <minchan@kernel.org> writes:
>> >> >> > On Tue, Sep 13, 2016 at 02:40:00PM +0800, Huang, Ying wrote:
>> >> >> >> Minchan Kim <minchan@kernel.org> writes:
>> >> >> >> 
>> >> >> >> > Hi Huang,
>> >> >> >> >
>> >> >> >> > On Fri, Sep 09, 2016 at 01:35:12PM -0700, Huang, Ying wrote:
>> >> >> >> >
>> 
>> [snip]
>> 
>> >> >> > 1. If we solve batching swapout, then how is THP split for swapout bad?
>> >> >> > 2. Also, how is current conservatie swapin from khugepaged bad?
>> >> >> >
>> >> >> > I think it's one of decision point for the motivation of your work
>> >> >> > and for 1, we need batching swapout feature.
>> >> >> >
>> >> >> > I am saying again that I'm not against your goal but only concern
>> >> >> > is approach. If you don't agree, please ignore me.
>> >> >> 
>> >> >> I am glad to discuss my final goal, that is, swapping out/in the full
>> >> >> THP without splitting.  Why I want to do that is copied as below,
>> >> >
>> >> > Yes, it's your *final* goal but what if it couldn't be acceptable
>> >> > on second step you mentioned above, for example?
>> >> >
>> >> >         Unncessary binded implementation to rejected work.
>> >> 
>> >> So I want to discuss my final goal.  If people accept my final goal,
>> >> this is resolved.  If people don't accept, I will reconsider it.
>> >
>> > No.
>> >
>> > Please keep it in mind. There are lots of factors the project would
>> > be broken during going on by several reasons because we are human being
>> > so we can simply miss something clear and realize it later that it's
>> > not feasible. Otherwise, others can show up with better idea for the
>> > goal or fix other subsystem which can affect your goals.
>> > I don't want to say such boring theoretical stuffs any more.
>> >
>> > My point is patchset should be self-contained if you really want to go
>> > with step-by-step approach because we are likely to miss something
>> > *easily*.
>> >
>> >> 
>> >> > If you want to achieve your goal step by step, please consider if
>> >> > one of step you are thinking could be rejected but steps already
>> >> > merged should be self-contained without side-effect.
>> >> 
>> >> What is the side-effect or possible regressions of the step 1 as in this
>> >
>> > Adding code complexity for unproved feature.
>> >
>> > When I read your steps, your *most important* goal is to avoid split/
>> > collapsing anon THP page for swap out/in. As a bonus with the approach,
>> > we could increase swapout/in bandwidth, too. Do I understand correctly?
>> 
>> It's hard to say what is the *most important* goal.  But it is clear
>> that to improve swapout/in performance isn't the only goal.  The other
>> goal to avoid split/collapsing THP page for swap out/in is very
>> important too.
>
> Okay, then, couldn't you focus a goal in patchset? After solving a problem,
> then next one. What's the problem?
> One of your goal is swapout performance and it's same with Tim's work.
> That's why I wanted to make your patchset based on Tim's work. But if you
> want your patch first, please make patchset independent with your other goal
> so everyone can review easily and focus on *a* problem.
> In your patchset, THP split delaying part could be folded into in your second
> patchset which is to avoid THP split/collapsing.
I thought multiple goals for one patchset is common.  But if you want
just one goal for review, I suggest you to review the patchset for the
goal to avoid split/collapsing anon THP page for swap out/in.  And this
patchset is just the first step for that.
>> > However, swap-in/out bandwidth enhance is common requirement for both
>> > normal and THP page and with Tim's work, we could enhance swapout path.
>> >
>> > So, I think you should give us to number about how THP split is bad
>> > for the swapout bandwidth even though we applied Tim's work.
>> > If it's serious, next approach is yours that we could tweak swap code
>> > be aware of a THP to avoid splitting a THP.
>> 
>> It's not only about CPU cycles spent in splitting and collapsing THP,
>> but also how to make THP work effectively on systems with swap turned
>> on.
>> 
>> To avoid disturbing user applications etc., THP collapsing doesn't work
>> aggressively to collapse anonymous pages into THP.  This means, once the
>> THP is split, it will take quite long time (wall time, instead of CPU
>> cycles) to be collapsed to become a THP, especially on machines with
>> large memory size.  And on systems with swap turned on, THP will be
>> split during swap out/in now.  If much swapping out/in is triggered
>> during system running, it is possible that many THP is split, and have
>> no chance to be collapsed.  Even if the THP that has been split gets
>> opportunity to be collapsed again, the applications lose the opportunity
>> to take advantage of the THP for quite long time too.  And the memory
>> will be fragmented during the process, this makes it hard to allocate
>> new THP.  The end result is that THP usage is very low in this
>> situation.  One solution is to avoid to split/collapse THP during swap
>> out/in.
>
> I understand what you want. I have a few questions for the goal but
> will not ask now because I want to see more in your description to
> understand current situation well.
>
> Huang, please, don't mix your goals in a patchset and include your
> claim with number we can justify. It would make more reviewer happy.
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64
  2016-09-20  2:01       ` Huang, Ying
@ 2016-09-22 19:25         ` Johannes Weiner
  0 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2016-09-22 19:25 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Anshuman Khandual, Andrew Morton, tim.c.chen, dave.hansen,
	andi.kleen, aaron.lu, linux-mm, linux-kernel, Hugh Dickins,
	Shaohua Li, Minchan Kim, Rik van Riel
Hi Ying,
On Tue, Sep 20, 2016 at 10:01:30AM +0800, Huang, Ying wrote:
> It appears all patches other than [10/10] in the series is used by the
> last patch [10/10], directly or indirectly.  And Without [10/10], they
> don't make much sense.  So you suggest me to use one large patch?
> Something like below?  Does that help you to review?
I find this version a lot easier to review, thank you.
> As the first step, in this patch, the splitting huge page is
> delayed from almost the first step of swapping out to after allocating
> the swap space for the THP and adding the THP into the swap cache.
> This will reduce lock acquiring/releasing for the locks used for the
> swap cache management.
I agree that that's a fine goal for this patch series. We can worry
about 2MB IO submissions later on.
> @@ -503,6 +503,19 @@ config FRONTSWAP
>  
>  	  If unsure, say Y to enable frontswap.
>  
> +config ARCH_USES_THP_SWAP_CLUSTER
> +	bool
> +	default n
> +
> +config THP_SWAP_CLUSTER
> +	bool
> +	depends on SWAP && TRANSPARENT_HUGEPAGE && ARCH_USES_THP_SWAP_CLUSTER
> +	default y
> +	help
> +	  Use one swap cluster to hold the contents of the THP
> +	  (Transparent Huge Page) swapped out.  The size of the swap
> +	  cluster will be same as that of THP.
Making swap space allocation and swapcache handling THP-native is not
dependent on the architecture, it's generic VM code. Can you please
just define the cluster size depending on CONFIG_TRANSPARENT_HUGEPAGE?
> @@ -196,7 +196,11 @@ static void discard_swap_cluster(struct
>  	}
>  }
>  
> +#ifdef CONFIG_THP_SWAP_CLUSTER
> +#define SWAPFILE_CLUSTER	(HPAGE_SIZE / PAGE_SIZE)
> +#else
>  #define SWAPFILE_CLUSTER	256
> +#endif
>  #define LATENCY_LIMIT		256
I.e. this?
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define SWAPFILE_CLUSTER	HPAGE_PMD_NR
#else
#define SWAPFILE_CLUSTER	256
#endif
> @@ -18,6 +18,13 @@ struct swap_cgroup {
>  };
>  #define SC_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
>  
> +struct swap_cgroup_iter {
> +	struct swap_cgroup_ctrl *ctrl;
> +	struct swap_cgroup *sc;
> +	swp_entry_t entry;
> +	unsigned long flags;
> +};
> +
>  /*
>   * SwapCgroup implements "lookup" and "exchange" operations.
>   * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
> @@ -75,6 +82,35 @@ static struct swap_cgroup *lookup_swap_c
>  	return sc + offset % SC_PER_PAGE;
>  }
>  
> +static void swap_cgroup_iter_init(struct swap_cgroup_iter *iter,
> +				  swp_entry_t ent)
> +{
> +	iter->entry = ent;
> +	iter->sc = lookup_swap_cgroup(ent, &iter->ctrl);
> +	spin_lock_irqsave(&iter->ctrl->lock, iter->flags);
> +}
> +
> +static void swap_cgroup_iter_exit(struct swap_cgroup_iter *iter)
> +{
> +	spin_unlock_irqrestore(&iter->ctrl->lock, iter->flags);
> +}
> +
> +/*
> + * swap_cgroup is stored in a kind of discontinuous array.  That is,
> + * they are continuous in one page, but not across page boundary.  And
> + * there is one lock for each page.
> + */
> +static void swap_cgroup_iter_advance(struct swap_cgroup_iter *iter)
> +{
> +	iter->sc++;
> +	iter->entry.val++;
> +	if (!(((unsigned long)iter->sc) & PAGE_MASK)) {
> +		spin_unlock_irqrestore(&iter->ctrl->lock, iter->flags);
> +		iter->sc = lookup_swap_cgroup(iter->entry, &iter->ctrl);
> +		spin_lock_irqsave(&iter->ctrl->lock, iter->flags);
> +	}
> +}
> +
>  /**
>   * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
>   * @ent: swap entry to be cmpxchged
> @@ -87,45 +123,49 @@ static struct swap_cgroup *lookup_swap_c
>  unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
>  					unsigned short old, unsigned short new)
>  {
> -	struct swap_cgroup_ctrl *ctrl;
> -	struct swap_cgroup *sc;
> -	unsigned long flags;
> +	struct swap_cgroup_iter iter;
>  	unsigned short retval;
>  
> -	sc = lookup_swap_cgroup(ent, &ctrl);
> +	swap_cgroup_iter_init(&iter, ent);
>  
> -	spin_lock_irqsave(&ctrl->lock, flags);
> -	retval = sc->id;
> +	retval = iter.sc->id;
>  	if (retval == old)
> -		sc->id = new;
> +		iter.sc->id = new;
>  	else
>  		retval = 0;
> -	spin_unlock_irqrestore(&ctrl->lock, flags);
> +
> +	swap_cgroup_iter_exit(&iter);
>  	return retval;
>  }
>  
>  /**
> - * swap_cgroup_record - record mem_cgroup for this swp_entry.
> - * @ent: swap entry to be recorded into
> + * swap_cgroup_record - record mem_cgroup for a set of swap entries
> + * @ent: the first swap entry to be recorded into
>   * @id: mem_cgroup to be recorded
> + * @nr_ents: number of swap entries to be recorded
>   *
>   * Returns old value at success, 0 at failure.
>   * (Of course, old value can be 0.)
>   */
> -unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
> +unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
> +				  unsigned int nr_ents)
>  {
> -	struct swap_cgroup_ctrl *ctrl;
> -	struct swap_cgroup *sc;
> +	struct swap_cgroup_iter iter;
>  	unsigned short old;
> -	unsigned long flags;
>  
> -	sc = lookup_swap_cgroup(ent, &ctrl);
> +	swap_cgroup_iter_init(&iter, ent);
>  
> -	spin_lock_irqsave(&ctrl->lock, flags);
> -	old = sc->id;
> -	sc->id = id;
> -	spin_unlock_irqrestore(&ctrl->lock, flags);
> +	old = iter.sc->id;
> +	for (;;) {
> +		VM_BUG_ON(iter.sc->id != old);
> +		iter.sc->id = id;
> +		nr_ents--;
> +		if (!nr_ents)
> +			break;
> +		swap_cgroup_iter_advance(&iter);
> +	}
>  
> +	swap_cgroup_iter_exit(&iter);
>  	return old;
>  }
The iterator seems overkill for one real user, and it's undesirable in
the single-slot access from swap_cgroup_cmpxchg(). How about something
like the following?
static struct swap_cgroup *lookup_swap_cgroup(struct swap_cgroup_ctrl *ctrl,
					      pgoff_t offset)
{
	struct page *page;
	page = page_address(ctrl->map[offset / SC_PER_PAGE]);
	return page + (offset % SC_PER_PAGE);
}
unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
					unsigned short old, unsigned short new)
{
	struct swap_cgroup_ctrl *ctrl;
	struct swap_cgroup *sc;
	unsigned long flags;
	unsigned short retval;
	pgoff_t off = swp_offset(ent);
	ctrl = &swap_cgroup_ctrl[swp_type(ent)];
	sc = lookup_swap_cgroup(ctrl, swp_offset(ent));
	spin_lock_irqsave(&ctrl->lock, flags);
	retval = sc->id;
	if (retval == old)
		sc->id = new;
	else
		retval = 0;
	spin_unlock_irqrestore(&ctrl->lock, flags);
	return retval;
}
unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
				  unsigned int nr_entries)
{
	struct swap_cgroup_ctrl *ctrl;
	struct swap_cgroup *sc;
	unsigned short old;
	unsigned long flags;
	ctrl = &swap_cgroup_ctrl[swp_type(ent)];
	sc = lookup_swap_cgroup(ctrl, offset);
	end = offset + nr_entries;
	spin_lock_irqsave(&ctrl->lock, flags);
	old = sc->id;
	while (offset != end) {
		sc->id = id;
		offset++;
		if (offset % SC_PER_PAGE)
			sc++;
		else
			sc = lookup_swap_cgroup(ctrl, offset);
	}
	spin_unlock_irqrestore(&ctrl->lock, flags);
	return old;
}
> @@ -145,20 +162,66 @@ void __delete_from_swap_cache(struct pag
>  
>  	entry.val = page_private(page);
>  	address_space = swap_address_space(entry);
> -	radix_tree_delete(&address_space->page_tree, page_private(page));
> -	set_page_private(page, 0);
>  	ClearPageSwapCache(page);
> -	address_space->nrpages--;
> -	__dec_node_page_state(page, NR_FILE_PAGES);
> -	INC_CACHE_INFO(del_total);
> +	for (i = 0; i < nr; i++) {
> +		struct page *cur_page = page + i;
> +
> +		radix_tree_delete(&address_space->page_tree,
> +				  page_private(cur_page));
> +		set_page_private(cur_page, 0);
> +	}
> +	address_space->nrpages -= nr;
> +	__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
> +	ADD_CACHE_INFO(del_total, nr);
> +}
> +
> +#ifdef CONFIG_THP_SWAP_CLUSTER
> +int add_to_swap_trans_huge(struct page *page, struct list_head *list)
> +{
> +	swp_entry_t entry;
> +	int ret = 0;
> +
> +	/* cannot split, which may be needed during swap in, skip it */
> +	if (!can_split_huge_page(page))
> +		return -EBUSY;
> +	/* fallback to split huge page firstly if no PMD map */
> +	if (!compound_mapcount(page))
> +		return 0;
The can_split_huge_page() (and maybe also the compound_mapcount())
optimizations look like they could be split out into separate followup
patches. They're definitely nice to have, but don't seem necessary to
make this patch minimally complete.
> @@ -168,11 +231,23 @@ int add_to_swap(struct page *page, struc
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	VM_BUG_ON_PAGE(!PageUptodate(page), page);
>  
> +	if (unlikely(PageTransHuge(page))) {
> +		err = add_to_swap_trans_huge(page, list);
> +		switch (err) {
> +		case 1:
> +			return 1;
> +		case 0:
> +			/* fallback to split firstly if return 0 */
> +			break;
> +		default:
> +			return 0;
> +		}
> +	}
>  	entry = get_swap_page();
>  	if (!entry.val)
>  		return 0;
>  
> -	if (mem_cgroup_try_charge_swap(page, entry)) {
> +	if (mem_cgroup_try_charge_swap(page, entry, 1)) {
>  		swapcache_free(entry);
>  		return 0;
>  	}
Instead of duplicating the control flow at such a high level -
add_to_swap() and add_to_swap_trans_huge() are basically identical -
it's better push down the THP handling as low as possible:
Pass the page to get_swap_page(), and then decide in there whether
it's THP and you need to allocate a single entry or a cluster.
And mem_cgroup_try_charge_swap() already gets the page. Again, check
in there how much swap to charge based on the passed page instead of
passing the same information twice.
Doing that will change the structure of the patch too much to review
the paths below in their current form. I'll have a closer look in the
next version.
Thanks!
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-07 16:45 [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (11 preceding siblings ...)
  2016-09-19 17:33 ` Hugh Dickins
@ 2016-09-22 22:56 ` Shaohua Li
  2016-09-22 23:49   ` Chen, Tim C
                     ` (2 more replies)
  12 siblings, 3 replies; 60+ messages in thread
From: Shaohua Li @ 2016-09-22 22:56 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko
On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
> 
> The advantages of the THP swap support include:
> 
> - Batch the swap operations for the THP to reduce lock
>   acquiring/releasing, including allocating/freeing the swap space,
>   adding/deleting to/from the swap cache, and writing/reading the swap
>   space, etc.  This will help improve the performance of the THP swap.
> 
> - The THP swap space read/write will be 2M sequential IO.  It is
>   particularly helpful for the swap read, which usually are 4k random
>   IO.  This will improve the performance of the THP swap too.
I think this is not a problem. Even with current early split, we are allocating
swap entry sequentially, after IO is dispatched, block layer will merge IO to
big size.
> - It will help the memory fragmentation, especially when the THP is
>   heavily used by the applications.  The 2M continuous pages will be
>   free up after THP swapping out.
So this is impossible without THP swapin. While 2M swapout makes a lot of
sense, I doubt 2M swapin is really useful. What kind of application is
'optimized' to do sequential memory access?
One advantage of THP swapout is to reduce TLB flush. Eg, when we split 2m to 4k
pages, we set swap entry for the 4k pages since your patch already allocates
swap entry before the split, so we only do tlb flush once in the split. Without
the delay THP split, we do twice tlb flush (split and unmap of swapout). I
don't see this in the patches, do I misread the code?
Thanks,
Shaohua
^ permalink raw reply	[flat|nested] 60+ messages in thread
* RE: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-22 22:56 ` Shaohua Li
@ 2016-09-22 23:49   ` Chen, Tim C
  2016-09-22 23:53     ` Andi Kleen
  2016-09-23  0:38   ` Rik van Riel
  2016-09-23  2:12   ` Huang, Ying
  2 siblings, 1 reply; 60+ messages in thread
From: Chen, Tim C @ 2016-09-22 23:49 UTC (permalink / raw)
  To: Shaohua Li, Huang, Ying
  Cc: Andrew Morton, Hansen, Dave, Kleen, Andi, Lu, Aaron,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, Hugh Dickins,
	Minchan Kim, Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko
>
>So this is impossible without THP swapin. While 2M swapout makes a lot of
>sense, I doubt 2M swapin is really useful. What kind of application is 'optimized'
>to do sequential memory access?
We waste a lot of cpu cycles to re-compact 4K pages back to a large page
under THP.  Swapping it back in as a single large page can avoid
fragmentation and this overhead.
Thanks.
Tim
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-22 23:49   ` Chen, Tim C
@ 2016-09-22 23:53     ` Andi Kleen
  0 siblings, 0 replies; 60+ messages in thread
From: Andi Kleen @ 2016-09-22 23:53 UTC (permalink / raw)
  To: Chen, Tim C
  Cc: Shaohua Li, Huang, Ying, Andrew Morton, Hansen, Dave, Kleen, Andi,
	Lu, Aaron, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickins, Minchan Kim, Rik van Riel, Andrea Arcangeli,
	Kirill A . Shutemov, Vladimir Davydov, Johannes Weiner,
	Michal Hocko
"Chen, Tim C" <tim.c.chen@intel.com> writes:
>>
>>So this is impossible without THP swapin. While 2M swapout makes a lot of
>>sense, I doubt 2M swapin is really useful. What kind of application is 'optimized'
>>to do sequential memory access?
Anything that touches regions larger than 4K and we want to do the
kernel do minimal work to manage the swapping.
>
> We waste a lot of cpu cycles to re-compact 4K pages back to a large page
> under THP.  Swapping it back in as a single large page can avoid
> fragmentation and this overhead.
Also splitting something just to merge it again is wasteful.
A lot of big improvements in the block and VM and network layers
over the years came from avoiding that kind of wasteful work.
-Andi
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-22 22:56 ` Shaohua Li
  2016-09-22 23:49   ` Chen, Tim C
@ 2016-09-23  0:38   ` Rik van Riel
  2016-09-23  2:32     ` Huang, Ying
  2016-09-23  2:12   ` Huang, Ying
  2 siblings, 1 reply; 60+ messages in thread
From: Rik van Riel @ 2016-09-23  0:38 UTC (permalink / raw)
  To: Shaohua Li, Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko
[-- Attachment #1: Type: text/plain, Size: 906 bytes --]
On Thu, 2016-09-22 at 15:56 -0700, Shaohua Li wrote:
> On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
> > 
> > - It will help the memory fragmentation, especially when the THP is
> >   heavily used by the applications.  The 2M continuous pages will
> > be
> >   free up after THP swapping out.
> 
> So this is impossible without THP swapin. While 2M swapout makes a
> lot of
> sense, I doubt 2M swapin is really useful. What kind of application
> is
> 'optimized' to do sequential memory access?
I suspect a lot of this will depend on the ratio of storage
speed to CPU & RAM speed.
When swapping to a spinning disk, it makes sense to avoid
extra memory use on swapin, and work in 4kB blocks.
When swapping to NVRAM, it makes sense to use 2MB blocks,
because that storage can handle data faster than we can
manage 4kB pages in the VM.
-- 
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-22 22:56 ` Shaohua Li
  2016-09-22 23:49   ` Chen, Tim C
  2016-09-23  0:38   ` Rik van Riel
@ 2016-09-23  2:12   ` Huang, Ying
  2 siblings, 0 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-23  2:12 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko
Hi, Shaohua,
Thanks for comments!
Shaohua Li <shli@kernel.org> writes:
> On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
>> 
>> The advantages of the THP swap support include:
Sorry for confusing.  This is the advantages of the final goal, that is,
avoid splitting/collapsing the THP during swap out/in, not the
advantages of this patchset.  This patchset is just the first step of
the final goal.  So some advantages of the final goal is not reflected
in this patchset.
>> - Batch the swap operations for the THP to reduce lock
>>   acquiring/releasing, including allocating/freeing the swap space,
>>   adding/deleting to/from the swap cache, and writing/reading the swap
>>   space, etc.  This will help improve the performance of the THP swap.
>> 
>> - The THP swap space read/write will be 2M sequential IO.  It is
>>   particularly helpful for the swap read, which usually are 4k random
>>   IO.  This will improve the performance of the THP swap too.
>
> I think this is not a problem. Even with current early split, we are allocating
> swap entry sequentially, after IO is dispatched, block layer will merge IO to
> big size.
Yes.  For swap out, the original implementation can merge IO to big size
already.  But for the THP swap out, instead of allocating one bio for
each 4k page in a THP, we can allocate one bio for each THP.  This will
avoid many useless CPU cycles to split then merge.  I think this will
help performance for the fast storage device.
>> - It will help the memory fragmentation, especially when the THP is
>>   heavily used by the applications.  The 2M continuous pages will be
>>   free up after THP swapping out.
>
> So this is impossible without THP swapin. While 2M swapout makes a lot of
> sense, I doubt 2M swapin is really useful. What kind of application is
> 'optimized' to do sequential memory access?
Although applications usually don't do much sequential memory access,
they still have space locality.  And after 2M swap in, the THP before
swapped out is kept to be a THP after swapped in.  It can be mapped into
the PMD of the application.  This will help reduce the TLB contention.
> One advantage of THP swapout is to reduce TLB flush. Eg, when we split 2m to 4k
> pages, we set swap entry for the 4k pages since your patch already allocates
> swap entry before the split, so we only do tlb flush once in the split. Without
> the delay THP split, we do twice tlb flush (split and unmap of swapout). I
> don't see this in the patches, do I misread the code?
Combining THP splitting with unmapping?  That sounds like a good idea.
It is not implemented in this patchset because I have not thought about
that before :).
In the next step of THP swap support, I will further delay THP splitting
after swapping out finished.  At that time, we will avoid calling
split_huge_page_to_list() during swapping out.  So the TLB flush only
need to be done once for unmap.
Best Regards,
Huang, Ying
> Thanks,
> Shaohua
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-23  0:38   ` Rik van Riel
@ 2016-09-23  2:32     ` Huang, Ying
  2016-09-25 19:18       ` Shaohua Li
  0 siblings, 1 reply; 60+ messages in thread
From: Huang, Ying @ 2016-09-23  2:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Shaohua Li, Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen,
	andi.kleen, aaron.lu, linux-mm, linux-kernel, Hugh Dickins,
	Minchan Kim, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko
Rik van Riel <riel@redhat.com> writes:
> On Thu, 2016-09-22 at 15:56 -0700, Shaohua Li wrote:
>> On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
>> > 
>> > - It will help the memory fragmentation, especially when the THP is
>> >   heavily used by the applications.  The 2M continuous pages will
>> > be
>> >   free up after THP swapping out.
>> 
>> So this is impossible without THP swapin. While 2M swapout makes a
>> lot of
>> sense, I doubt 2M swapin is really useful. What kind of application
>> is
>> 'optimized' to do sequential memory access?
>
> I suspect a lot of this will depend on the ratio of storage
> speed to CPU & RAM speed.
>
> When swapping to a spinning disk, it makes sense to avoid
> extra memory use on swapin, and work in 4kB blocks.
For spinning disk, the THP swap optimization will be turned off in
current implementation.  Because huge swap cluster allocation based on
swap cluster management, which is available only for non-rotating block
devices (blk_queue_nonrot()).
> When swapping to NVRAM, it makes sense to use 2MB blocks,
> because that storage can handle data faster than we can
> manage 4kB pages in the VM.
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-23  2:32     ` Huang, Ying
@ 2016-09-25 19:18       ` Shaohua Li
  2016-09-26  1:06         ` Minchan Kim
  2016-09-26  3:25         ` Huang, Ying
  0 siblings, 2 replies; 60+ messages in thread
From: Shaohua Li @ 2016-09-25 19:18 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Rik van Riel, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko
On Fri, Sep 23, 2016 at 10:32:39AM +0800, Huang, Ying wrote:
> Rik van Riel <riel@redhat.com> writes:
> 
> > On Thu, 2016-09-22 at 15:56 -0700, Shaohua Li wrote:
> >> On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
> >> > 
> >> > - It will help the memory fragmentation, especially when the THP is
> >> >   heavily used by the applications.  The 2M continuous pages will
> >> > be
> >> >   free up after THP swapping out.
> >> 
> >> So this is impossible without THP swapin. While 2M swapout makes a
> >> lot of
> >> sense, I doubt 2M swapin is really useful. What kind of application
> >> is
> >> 'optimized' to do sequential memory access?
> >
> > I suspect a lot of this will depend on the ratio of storage
> > speed to CPU & RAM speed.
> >
> > When swapping to a spinning disk, it makes sense to avoid
> > extra memory use on swapin, and work in 4kB blocks.
> 
> For spinning disk, the THP swap optimization will be turned off in
> current implementation.  Because huge swap cluster allocation based on
> swap cluster management, which is available only for non-rotating block
> devices (blk_queue_nonrot()).
For 2m swapin, as long as one byte is changed in the 2m, next time we must do
2m swapout. There is huge waste of memory and IO bandwidth and increases
unnecessary memory pressure. 2M IO will very easily saturate a very fast SSD
and makes IO the bottleneck. Not sure about NVRAM though.
Thanks,
Shaohua
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-25 19:18       ` Shaohua Li
@ 2016-09-26  1:06         ` Minchan Kim
  2016-09-26  3:25         ` Huang, Ying
  1 sibling, 0 replies; 60+ messages in thread
From: Minchan Kim @ 2016-09-26  1:06 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Huang, Ying, Rik van Riel, Andrew Morton, tim.c.chen, dave.hansen,
	andi.kleen, aaron.lu, linux-mm, linux-kernel, Hugh Dickins,
	Minchan Kim, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko
On Sun, Sep 25, 2016 at 12:18:49PM -0700, Shaohua Li wrote:
> On Fri, Sep 23, 2016 at 10:32:39AM +0800, Huang, Ying wrote:
> > Rik van Riel <riel@redhat.com> writes:
> > 
> > > On Thu, 2016-09-22 at 15:56 -0700, Shaohua Li wrote:
> > >> On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
> > >> > 
> > >> > - It will help the memory fragmentation, especially when the THP is
> > >> >   heavily used by the applications.  The 2M continuous pages will
> > >> > be
> > >> >   free up after THP swapping out.
> > >> 
> > >> So this is impossible without THP swapin. While 2M swapout makes a
> > >> lot of
> > >> sense, I doubt 2M swapin is really useful. What kind of application
> > >> is
> > >> 'optimized' to do sequential memory access?
> > >
> > > I suspect a lot of this will depend on the ratio of storage
> > > speed to CPU & RAM speed.
> > >
> > > When swapping to a spinning disk, it makes sense to avoid
> > > extra memory use on swapin, and work in 4kB blocks.
> > 
> > For spinning disk, the THP swap optimization will be turned off in
> > current implementation.  Because huge swap cluster allocation based on
> > swap cluster management, which is available only for non-rotating block
> > devices (blk_queue_nonrot()).
> 
> For 2m swapin, as long as one byte is changed in the 2m, next time we must do
> 2m swapout. There is huge waste of memory and IO bandwidth and increases
> unnecessary memory pressure. 2M IO will very easily saturate a very fast SSD
I agree. No doubt THP swapout is helpful for overall performance but
THP swapin should be more careful. It would cause memory pressure which
could evict warm pages which mitigates THP's benefit. THP swapin also
would increase minor fault latency, too.
If we want to swap in a THP, I think we need something to guarantee that
subpages in a THP swapped out were hot and temporal locality so that
it's worth to swap in a THP page to lose other memory kept in in memory.
Maybe it would not matter so much in MADVISE mode where userspace knows
pros and cons and choosed it. The problem would be there in ALWAYS mode.
One of idea is we can raise bar to collapse THP page higher, for example,
reducing khugepaged_max_ptes_none and introducing khugepaged_max_pte_ref.
With that, khugepaged would collapse 4K pages into a THP only if most of
subpages are mapped and hot.
^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out
  2016-09-25 19:18       ` Shaohua Li
  2016-09-26  1:06         ` Minchan Kim
@ 2016-09-26  3:25         ` Huang, Ying
  1 sibling, 0 replies; 60+ messages in thread
From: Huang, Ying @ 2016-09-26  3:25 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Huang, Ying, Rik van Riel, Andrew Morton, tim.c.chen, dave.hansen,
	andi.kleen, aaron.lu, linux-mm, linux-kernel, Hugh Dickins,
	Minchan Kim, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=ascii, Size: 1712 bytes --]
Shaohua Li <shli@kernel.org> writes:
> On Fri, Sep 23, 2016 at 10:32:39AM +0800, Huang, Ying wrote:
>> Rik van Riel <riel@redhat.com> writes:
>> 
>> > On Thu, 2016-09-22 at 15:56 -0700, Shaohua Li wrote:
>> >> On Wed, Sep 07, 2016 at 09:45:59AM -0700, Huang, Ying wrote:
>> >> >.
>> >> > - It will help the memory fragmentation, especially when the THP is
>> >> > . heavily used by the applications.. The 2M continuous pages will
>> >> > be
>> >> > . free up after THP swapping out.
>> >> 
>> >> So this is impossible without THP swapin. While 2M swapout makes a
>> >> lot of
>> >> sense, I doubt 2M swapin is really useful. What kind of application
>> >> is
>> >> 'optimized' to do sequential memory access?
>> >
>> > I suspect a lot of this will depend on the ratio of storage
>> > speed to CPU & RAM speed.
>> >
>> > When swapping to a spinning disk, it makes sense to avoid
>> > extra memory use on swapin, and work in 4kB blocks.
>> 
>> For spinning disk, the THP swap optimization will be turned off in
>> current implementation.  Because huge swap cluster allocation based on
>> swap cluster management, which is available only for non-rotating block
>> devices (blk_queue_nonrot()).
>
> For 2m swapin, as long as one byte is changed in the 2m, next time we must do
> 2m swapout. There is huge waste of memory and IO bandwidth and increases
> unnecessary memory pressure. 2M IO will very easily saturate a very fast SSD
> and makes IO the bottleneck. Not sure about NVRAM though.
One solution is to make 2M swapin configurable, maybe via a sysfs file
in /sys/kernel/mm/transparent_hugepage/, so that we can turn on it only
for really fast storage devices, such as NVRAM, etc.
Best Regards,
Huang, Ying
^ permalink raw reply	[flat|nested] 60+ messages in thread
end of thread, other threads:[~2016-09-26  3:25 UTC | newest]
Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-09-07 16:45 [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64 Huang, Ying
2016-09-08  5:45   ` Anshuman Khandual
2016-09-08 18:07     ` Huang, Ying
2016-09-19 17:09     ` Johannes Weiner
2016-09-20  2:01       ` Huang, Ying
2016-09-22 19:25         ` Johannes Weiner
2016-09-08  8:21   ` Anshuman Khandual
2016-09-08 11:03   ` Kirill A. Shutemov
2016-09-08 17:39     ` Huang, Ying
2016-09-08 11:07   ` Kirill A. Shutemov
2016-09-08 17:23     ` Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 02/10] mm, memcg: Add swap_cgroup_iter iterator Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 03/10] mm, memcg: Support to charge/uncharge multiple swap entries Huang, Ying
2016-09-08  5:46   ` Anshuman Khandual
2016-09-08  8:28   ` Anshuman Khandual
2016-09-08 18:15     ` Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 04/10] mm, THP, swap: Add swap cluster allocate/free functions Huang, Ying
2016-09-08  5:49   ` Anshuman Khandual
2016-09-08  8:30   ` Anshuman Khandual
2016-09-08 18:14     ` Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 05/10] mm, THP, swap: Add get_huge_swap_page() Huang, Ying
2016-09-08 11:13   ` Kirill A. Shutemov
2016-09-08 17:22     ` Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 06/10] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 07/10] mm, THP, swap: Support to add/delete THP to/from swap cache Huang, Ying
2016-09-08  9:00   ` Anshuman Khandual
2016-09-08 18:10     ` Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 08/10] mm, THP: Add can_split_huge_page() Huang, Ying
2016-09-08 11:17   ` Kirill A. Shutemov
2016-09-08 17:02     ` Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 09/10] mm, THP, swap: Support to split THP in swap cache Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 10/10] mm, THP, swap: Delay splitting THP during swap out Huang, Ying
2016-09-09  5:43 ` [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Minchan Kim
2016-09-09 15:53   ` Tim Chen
2016-09-09 20:35   ` Huang, Ying
2016-09-13  6:13     ` Minchan Kim
2016-09-13  6:40       ` Huang, Ying
2016-09-13  7:05         ` Minchan Kim
2016-09-13  8:53           ` Huang, Ying
2016-09-13  9:16             ` Minchan Kim
2016-09-13 23:52               ` Chen, Tim C
2016-09-19  7:11                 ` Minchan Kim
2016-09-19 15:59                   ` Tim Chen
2016-09-18  1:53               ` Huang, Ying
2016-09-19  7:08                 ` Minchan Kim
2016-09-20  2:54                   ` Huang, Ying
2016-09-20  5:06                     ` Minchan Kim
2016-09-20  5:28                       ` Huang, Ying
2016-09-13 14:35             ` Andrea Arcangeli
2016-09-19 17:33 ` Hugh Dickins
2016-09-22 22:56 ` Shaohua Li
2016-09-22 23:49   ` Chen, Tim C
2016-09-22 23:53     ` Andi Kleen
2016-09-23  0:38   ` Rik van Riel
2016-09-23  2:32     ` Huang, Ying
2016-09-25 19:18       ` Shaohua Li
2016-09-26  1:06         ` Minchan Kim
2016-09-26  3:25         ` Huang, Ying
2016-09-23  2:12   ` Huang, Ying
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).