[PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
@ 2024-08-29 21:27 Kanchana P Sridhar
  2024-08-29 21:27 ` [PATCH v6 1/3] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
                   ` (4 more replies)
  0 siblings, 5 replies; 34+ messages in thread
From: Kanchana P Sridhar @ 2024-08-29 21:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Hi All,

This patch-series enables zswap_store() to accept and store mTHP
folios. The most significant contribution in this series is from the 
earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
migrated to v6.11-rc3 in patch 2/4 of this series.

[1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
     https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

Additionally, there is an attempt to modularize some of the functionality
in zswap_store(), to make it more amenable to supporting any-order
mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry
in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
delete all offsets corresponding to a higher order folio stored in zswap.

For accounting purposes, the patch-series adds per-order mTHP sysfs
"zswpout" counters that get incremented upon successful zswap_store of
an mTHP folio:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout

A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
will enable/disable zswap storing of (m)THP. When disabled, zswap will
fallback to rejecting the mTHP folio, to be processed by the backing
swap device.

This patch-series is a precursor to ZSWAP compress batching of mTHP
swap-out and decompress batching of swap-ins based on swapin_readahead(),
using Intel IAA hardware acceleration, which we would like to submit in
subsequent RFC patch-series, with performance improvement data.

Thanks to Ying Huang for pre-posting review feedback and suggestions!

Thanks also to Nhat, Yosry and Barry for their helpful feedback, data
reviews and suggestions!

Changes since v5:
=================
1) Rebased to mm-unstable as of 8/29/2024,
   commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
   enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
   suggestion to add a knob by which users can enable/disable this
   change. Nhat, I hope this is along the lines of what you were
   thinking.
3) Added vm-scalability usemem data with 4K folios with
   CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure
   there is no regression with this change.
4) Added data with usemem with 64K and 2M THP for an alternate view of
   before/after, as suggested by Yosry, so we can understand the impact
   of when mTHPs are split into 4K folios in shrink_folio_list()
   (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
   in zswap. Thanks Yosry for this suggestion.

Changes since v4:
=================
1) Published before/after data with zstd, as suggested by Nhat (Thanks
   Nhat for the data reviews!).
2) Rebased to mm-unstable from 8/27/2024,
   commit b659edec079c90012cf8d05624e312d1062b8b87.
3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
   CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
   robot; as per Nhat's and Michal's suggestion to not require a separate
   patch to fix the build errors (thanks both!).
4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
   suggested by Yosry (Thanks Yosry!).
5) Squashed the commits that define new mthp zswpout stat counters, and
   invoke count_mthp_stat() after successful zswap_store()s; into a single
   commit. Thanks Yosry for this suggestion!

Changes since v3:
=================
1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
   Thanks to Barry for suggesting aligning with Ryan Roberts' latest
   changes to count_mthp_stat() so that it's always defined, even when THP
   is disabled. Barry, I have also made one other change in page_io.c
   where count_mthp_stat() is called by count_swpout_vm_event(). I would
   appreciate it if you can review this. Thanks!
   Hopefully this should resolve the kernel robot build errors.

Changes since v2:
=================
1) Gathered usemem data using SSD as the backing swap device for zswap,
   as suggested by Ying Huang. Ying, I would appreciate it if you can
   review the latest data. Thanks!
2) Generated the base commit info in the patches to attempt to address
   the kernel test robot build errors.
3) No code changes to the individual patches themselves.

Changes since RFC v1:
=====================

1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
   Thanks Barry!
2) Addressed some of the code review comments that Nhat Pham provided in
   Ryan's initial RFC [1]:
   - Added a comment about the cgroup zswap limit checks occuring once per
     folio at the beginning of zswap_store().
     Nhat, Ryan, please do let me know if the comments convey the summary
     from the RFC discussion. Thanks!
   - Posted data on running the cgroup suite's zswap kselftest.
3) Rebased to v6.11-rc3.
4) Gathered performance data with usemem and the rebased patch-series.

Regression Testing:
===================
I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
folios with mm-unstable and with this patch-series. The main goal was
to make sure that there is no functional or performance regression
wrt the earlier zswap behavior for 4K folios,
CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of 4K
pages goes through the newly added code path [zswap_store(),
zswap_store_page()].

The data indicates there is no regression.

 ------------------------------------------------------------------------------
                     mm-unstable 8-28-2024                        zswap-mTHP v6
                                              CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
                                                                     is not set
 ------------------------------------------------------------------------------
 ZSWAP compressor        zstd     deflate-                     zstd    deflate-
                                       iaa                                  iaa
 ------------------------------------------------------------------------------
 Throughput (KB/s)    110,775      113,010               111,550        121,937
 sys time (sec)      1,141.72       954.87              1,131.95         828.47
 memcg_high           140,500      153,737               139,772        134,129
 memcg_swap_high            0            0                     0              0
 memcg_swap_fail            0            0                     0              0
 pswpin                     0            0                     0              0
 pswpout                    0            0                     0              0
 zswpin                   675          690                   682            684
 zswpout            9,552,298   10,603,271             9,566,392      9,267,213
 thp_swpout                 0            0                     0              0
 thp_swpout_                0            0                     0              0
  fallback                                                                     
 pgmajfault             3,453        3,468                 3,841          3,487
 ZSWPOUT-64kB-mTHP        n/a          n/a                     0              0
 SWPOUT-64kB-mTHP           0            0                     0              0
 ------------------------------------------------------------------------------

Performance Testing:
====================
Testing of this patch-series was done with the v6.11-rc3 mainline, without
and with this patch-series, on an Intel Sapphire Rapids server,
dual-socket 56 cores per socket, 4 IAA devices per socket.

The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as the
backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
Core frequency was fixed at 2500MHz.

The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 40G. The is no swap limit set for the cgroup. Following a
similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
series [2], 70 usemem processes were run, each allocating and writing 1G of
memory:

    usemem --init-time -w -O -n 70 1g

The vm/sysfs mTHP stats included with the performance data provide details
on the swapout activity to ZSWAP/swap.

Other kernel configuration parameters:

    ZSWAP Compressors : zstd, deflate-iaa
    ZSWAP Allocator   : zsmalloc
    SWAP page-cluster : 2

In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
IAA "compression verification" is enabled. Hence each IAA compression
will be decompressed internally by the "iaa_crypto" driver, the crc-s
returned by the hardware will be compared and errors reported in case of
mismatches. Thus "deflate-iaa" helps ensure better data integrity as
compared to the software compressors.

Throughput is derived by averaging the individual 70 processes' throughputs
reported by usemem. sys time is measured with perf. All data points are
averaged across 3 runs.

Case 1: Baseline with CONFIG_THP_SWAP turned off, and mTHP is split in reclaim.
===============================================================================

In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
64K/2M (m)THP to be split, and only 4K folios processed by zswap.

The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results
in 64K/2M (m)THP to not be split, and processed by zswap.

 64KB mTHP (cgroup memory.high set to 40G):
 ==========================================

 -------------------------------------------------------------------------------
                       v6.11-rc3 mainline              zswap-mTHP     Change wrt
                                 Baseline                               Baseline
                        CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)   136,113      140,044     140,363     151,938    3%       8%
 sys time (sec)       986.78       951.95      954.85      735.47    3%      23%
 memcg_high          124,183      127,513     138,651     133,884
 memcg_swap_high           0            0           0           0
 memcg_swap_fail     619,020      751,099           0           0
 pswpin                    0            0           0           0
 pswpout                   0            0           0           0
 zswpin                  656          569         624         639
 zswpout           9,413,603   11,284,812   9,453,761   9,385,910
 thp_swpout                0            0           0           0
 thp_swpout_               0            0           0           0
  fallback
 pgmajfault            3,470        3,382       4,633       3,611
 ZSWPOUT-64kB            n/a          n/a     590,768     586,521
 SWPOUT-64kB               0            0           0           0
 -------------------------------------------------------------------------------

 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
 =======================================================

 ------------------------------------------------------------------------------
                       v6.11-rc3 mainline              zswap-mTHP    Change wrt
                                 Baseline                              Baseline
                        CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
 ------------------------------------------------------------------------------
 ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
                                     iaa                     iaa            iaa
 ------------------------------------------------------------------------------
 Throughput (KB/s)    164,220    172,523      165,005     174,536  0.5%      1%
 sys time (sec)        855.76     686.94       801.72      676.65    6%      1%
 memcg_high            14,628     16,247       14,951      16,096
 memcg_swap_high            0          0            0           0
 memcg_swap_fail       18,698     21,114            0           0
 pswpin                     0          0            0           0
 pswpout                    0          0            0           0
 zswpin                   663        665        5,333         781
 zswpout            8,419,458  8,992,065    8,546,895   9,355,760
 thp_swpout                 0          0            0           0
 thp_swpout_           18,697     21,113            0           0
  fallback
 pgmajfault             3,439      3,496        8,139       3,582
 ZSWPOUT-2048kB           n/a        n/a       16,684      18,270
 SWPOUT-2048kB              0          0            0           0
 -----------------------------------------------------------------------------

We see improvements overall in throughput and sys time for zstd and
deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y).

Case 2: Baseline with CONFIG_THP_SWAP enabled.
==============================================

In this scenario, the "before" represents zswap rejecting mTHP, and the mTHP
being stored by the backing swap device.

The "after" represents data with this patch-series, that results in 64K/2M
(m)THP being processed by zswap.

 64KB mTHP (cgroup memory.high set to 40G):
 ==========================================

 ------------------------------------------------------------------------------
                     v6.11-rc3 mainline              zswap-mTHP      Change wrt
                               Baseline                                Baseline
 ------------------------------------------------------------------------------
 ZSWAP compressor       zstd   deflate-        zstd    deflate-   zstd deflate-
                                    iaa                     iaa             iaa
 ------------------------------------------------------------------------------
 Throughput (KB/s)   161,496    156,343     140,363     151,938   -13%      -3%
 sys time (sec)       771.68     802.08      954.85      735.47   -24%       8%
 memcg_high          111,223    110,889     138,651     133,884
 memcg_swap_high           0          0           0           0
 memcg_swap_fail           0          0           0           0
 pswpin                   16         16           0           0
 pswpout           7,471,472  7,527,963           0           0
 zswpin                  635        605         624         639
 zswpout               1,509      1,478   9,453,761   9,385,910
 thp_swpout                0          0           0           0
 thp_swpout_               0          0           0           0
  fallback
 pgmajfault            3,616      3,430       4,633       3,611
 ZSWPOUT-64kB            n/a        n/a     590,768     586,521
 SWPOUT-64kB         466,967    470,498           0           0
 ------------------------------------------------------------------------------

 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
 =======================================================

 ------------------------------------------------------------------------------
                      v6.11-rc3 mainline              zswap-mTHP     Change wrt
                                Baseline                               Baseline
 ------------------------------------------------------------------------------
 ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
                                     iaa                     iaa            iaa
 ------------------------------------------------------------------------------
 Throughput (KB/s)    192,164    194,643     165,005     174,536  -14%     -10%
 sys time (sec)        823.55     830.42      801.72      676.65    3%      19%
 memcg_high            16,054     15,936      14,951      16,096
 memcg_swap_high            0          0           0           0
 memcg_swap_fail            0          0           0           0
 pswpin                     0          0           0           0
 pswpout            8,629,248  8,628,907           0           0
 zswpin                   560        645       5,333         781
 zswpout                1,416      1,503   8,546,895   9,355,760
 thp_swpout            16,854     16,853           0           0
 thp_swpout_                0          0           0           0
  fallback
 pgmajfault             3,341      3,574       8,139       3,582
 ZSWPOUT-2048kB           n/a        n/a      16,684      18,270
 SWPOUT-2048kB         16,854     16,853           0           0
 ------------------------------------------------------------------------------

In the "Before" scenario, when zswap does not store mTHP, only allocations
count towards the cgroup memory limit. However, in the "After" scenario,
with the introduction of zswap_store() mTHP, both, allocations as well as
the zswap compressed pool usage from all 70 processes are counted towards
the memory limit. As a result, we see higher swapout activity in the
"After" data. Hence, more time is spent doing reclaim as the zswap cgroup
charge leads to more frequent memory.high breaches.

This causes degradation in throughput and sys time with zswap mTHP, more so
in case of zstd than deflate-iaa. Compress latency could play a part in
this - when there is more swapout activity happening, a slower compressor
would cause allocations to stall for any/all of the 70 processes.

In my opinion, even though the test set up does not provide an accurate
way for a direct before/after comparison (because of zswap usage being
counted in cgroup, hence towards the memory.high), it still seems
reasonable for zswap_store to support (m)THP, so that further performance
improvements can be implemented.

One of the ideas that has shown promise in our experiments is to improve
ZSWAP mTHP store performance using batching. With IAA compress/decompress
batching used in ZSWAP, we are able to demonstrate significant
performance improvements and memory savings with IAA in scalability
experiments, as compared to software compressors. We hope to submit
this work as subsequent RFCs.

I would greatly appreciate your code review comments and suggestions!

Thanks,
Kanchana

[2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/

Kanchana P Sridhar (3):
  mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
  mm: zswap: zswap_store() extended to handle mTHP folios.
  mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout
    stats.

 include/linux/huge_mm.h    |   1 +
 include/linux/memcontrol.h |   4 +
 mm/Kconfig                 |   8 ++
 mm/huge_memory.c           |   3 +
 mm/page_io.c               |   3 +-
 mm/zswap.c                 | 243 +++++++++++++++++++++++++++----------
 6 files changed, 200 insertions(+), 62 deletions(-)

base-commit: 9287e4adbc6ab8fa04d25eb82e097fed877a4642
-- 
2.27.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v6 1/3] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
  2024-08-29 21:27 [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
@ 2024-08-29 21:27 ` Kanchana P Sridhar
  2024-08-29 21:27 ` [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: Kanchana P Sridhar @ 2024-08-29 21:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This resolves an issue with obj_cgroup_get() not being defined if
CONFIG_MEMCG is not defined.

Before this patch, we would see build errors if obj_cgroup_get() is
called from code that is agnostic of CONFIG_MEMCG.

The zswap_store() changes for mTHP in subsequent commits will require
the use of obj_cgroup_get() in zswap code that falls into this category.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/memcontrol.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 34d2da05f2f1..15c2716f9aa3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1282,6 +1282,10 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
 	return NULL;
 }
 
+static inline void obj_cgroup_get(struct obj_cgroup *objcg)
+{
+}
+
 static inline void obj_cgroup_put(struct obj_cgroup *objcg)
 {
 }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle mTHP folios.
  2024-08-29 21:27 [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
  2024-08-29 21:27 ` [PATCH v6 1/3] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
@ 2024-08-29 21:27 ` Kanchana P Sridhar
  2024-08-29 23:06   ` Yosry Ahmed
                     ` (2 more replies)
  2024-08-29 21:27 ` [PATCH v6 3/3] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats Kanchana P Sridhar
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 34+ messages in thread
From: Kanchana P Sridhar @ 2024-08-29 21:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

zswap_store() will now process and store mTHP and PMD-size THP folios.

A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
will enable/disable zswap storing of (m)THP.

This change reuses and adapts the functionality in Ryan Roberts' RFC
patch [1]:

  "[RFC,v1] mm: zswap: Store large folios without splitting"

  [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

This patch provides a sequential implementation of storing an mTHP in
zswap_store() by iterating through each page in the folio to compress
and store it in the zswap zpool.

Towards this goal, zswap_compress() is modified to take a page instead
of a folio as input.

Each page's swap offset is stored as a separate zswap entry.

If an error is encountered during the store of any page in the mTHP,
all previous pages/entries stored will be invalidated. Thus, an mTHP
is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.

This forms the basis for building batching of pages during zswap store
of large folios, by compressing batches of up to say, 8 pages in an
mTHP in parallel in hardware, with the Intel In-Memory Analytics
Accelerator (Intel IAA).

Also, addressed some of the RFC comments from the discussion in [1].

Made a minor edit in the comments for "struct zswap_entry" to delete
the comments related to "value" since same-filled page handling has
been removed from zswap.

Co-developed-by: Ryan Roberts
Signed-off-by:
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/Kconfig |   8 ++
 mm/zswap.c | 243 +++++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 190 insertions(+), 61 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index b23913d4e47e..68c7b01120bd 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
 	  reducing the chance that cold pages will reside in the zswap pool
 	  and consume memory indefinitely.
 
+config ZSWAP_STORE_THP_DEFAULT_ON
+	bool "Store mTHP and THP folios in zswap"
+	depends on ZSWAP
+	default n
+	help
+	  If selected, zswap will process mTHP and THP folios by
+	  compressing and storing each 4K page in the large folio.
+
 choice
 	prompt "Default compressor"
 	depends on ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index 449914ea9919..3abf9810f0b7 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
 		CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
 module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
 
+/*
+ * Enable/disable zswap processing of mTHP folios.
+ * For now, only zswap_store will process mTHP folios.
+ */
+static bool zswap_mthp_enabled = IS_ENABLED(
+		CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
+module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644);
+
 bool zswap_is_enabled(void)
 {
 	return zswap_enabled;
@@ -190,7 +198,6 @@ static struct shrinker *zswap_shrinker;
  *              section for context.
  * pool - the zswap_pool the entry's data is in
  * handle - zpool allocation handle that stores the compressed page data
- * value - value of the same-value filled pages which have same content
  * objcg - the obj_cgroup that the compressed memory is charged to
  * lru - handle to the pool's lru used to evict pages.
  */
@@ -876,7 +883,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
-static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
+static bool zswap_compress(struct page *page, struct zswap_entry *entry)
 {
 	struct crypto_acomp_ctx *acomp_ctx;
 	struct scatterlist input, output;
@@ -894,7 +901,7 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
 
 	dst = acomp_ctx->buffer;
 	sg_init_table(&input, 1);
-	sg_set_folio(&input, folio, PAGE_SIZE, 0);
+	sg_set_page(&input, page, PAGE_SIZE, 0);
 
 	/*
 	 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
@@ -1404,35 +1411,82 @@ static void shrink_worker(struct work_struct *w)
 /*********************************
 * main API
 **********************************/
-bool zswap_store(struct folio *folio)
+
+/*
+ * Returns true if the entry was successfully
+ * stored in the xarray, and false otherwise.
+ */
+static bool zswap_store_entry(struct xarray *tree,
+			      struct zswap_entry *entry)
 {
-	swp_entry_t swp = folio->swap;
-	pgoff_t offset = swp_offset(swp);
-	struct xarray *tree = swap_zswap_tree(swp);
-	struct zswap_entry *entry, *old;
-	struct obj_cgroup *objcg = NULL;
-	struct mem_cgroup *memcg = NULL;
+	struct zswap_entry *old;
+	pgoff_t offset = swp_offset(entry->swpentry);
 
-	VM_WARN_ON_ONCE(!folio_test_locked(folio));
-	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+	old = xa_store(tree, offset, entry, GFP_KERNEL);
 
-	/* Large folios aren't supported */
-	if (folio_test_large(folio))
+	if (xa_is_err(old)) {
+		int err = xa_err(old);
+
+		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+		zswap_reject_alloc_fail++;
 		return false;
+	}
 
-	if (!zswap_enabled)
-		goto check_old;
+	/*
+	 * We may have had an existing entry that became stale when
+	 * the folio was redirtied and now the new version is being
+	 * swapped out. Get rid of the old.
+	 */
+	if (old)
+		zswap_entry_free(old);
 
-	/* Check cgroup limits */
-	objcg = get_obj_cgroup_from_folio(folio);
-	if (objcg && !obj_cgroup_may_zswap(objcg)) {
-		memcg = get_mem_cgroup_from_objcg(objcg);
-		if (shrink_memcg(memcg)) {
-			mem_cgroup_put(memcg);
-			goto reject;
-		}
-		mem_cgroup_put(memcg);
+	return true;
+}
+
+/*
+ * If the zswap store fails or zswap is disabled, we must invalidate the
+ * possibly stale entries which were previously stored at the offsets
+ * corresponding to each page of the folio. Otherwise, writeback could
+ * overwrite the new data in the swapfile.
+ *
+ * This is called after the store of the i-th offset in a large folio has
+ * failed. All zswap entries in the folio must be deleted. This helps make
+ * sure that a swapped-out mTHP is either entirely stored in zswap, or
+ * entirely not stored in zswap.
+ *
+ * This is also called if zswap_store() is invoked, but zswap is not enabled.
+ * All offsets for the folio are deleted from zswap in this case.
+ */
+static void zswap_delete_stored_offsets(struct xarray *tree,
+					pgoff_t offset,
+					long nr_pages)
+{
+	struct zswap_entry *entry;
+	long i;
+
+	for (i = 0; i < nr_pages; ++i) {
+		entry = xa_erase(tree, offset + i);
+		if (entry)
+			zswap_entry_free(entry);
 	}
+}
+
+/*
+ * Stores the page at specified "index" in a folio.
+ */
+static bool zswap_store_page(struct folio *folio, long index,
+			     struct obj_cgroup *objcg,
+			     struct zswap_pool *pool)
+{
+	swp_entry_t swp = folio->swap;
+	int type = swp_type(swp);
+	pgoff_t offset = swp_offset(swp) + index;
+	struct page *page = folio_page(folio, index);
+	struct xarray *tree = swap_zswap_tree(swp);
+	struct zswap_entry *entry;
+
+	if (objcg)
+		obj_cgroup_get(objcg);
 
 	if (zswap_check_limits())
 		goto reject;
@@ -1445,42 +1499,20 @@ bool zswap_store(struct folio *folio)
 	}
 
 	/* if entry is successfully added, it keeps the reference */
-	entry->pool = zswap_pool_current_get();
-	if (!entry->pool)
+	if (!zswap_pool_get(pool))
 		goto freepage;
 
-	if (objcg) {
-		memcg = get_mem_cgroup_from_objcg(objcg);
-		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
-			mem_cgroup_put(memcg);
-			goto put_pool;
-		}
-		mem_cgroup_put(memcg);
-	}
+	entry->pool = pool;
 
-	if (!zswap_compress(folio, entry))
+	if (!zswap_compress(page, entry))
 		goto put_pool;
 
-	entry->swpentry = swp;
+	entry->swpentry = swp_entry(type, offset);
 	entry->objcg = objcg;
 	entry->referenced = true;
 
-	old = xa_store(tree, offset, entry, GFP_KERNEL);
-	if (xa_is_err(old)) {
-		int err = xa_err(old);
-
-		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
-		zswap_reject_alloc_fail++;
+	if (!zswap_store_entry(tree, entry))
 		goto store_failed;
-	}
-
-	/*
-	 * We may have had an existing entry that became stale when
-	 * the folio was redirtied and now the new version is being
-	 * swapped out. Get rid of the old.
-	 */
-	if (old)
-		zswap_entry_free(old);
 
 	if (objcg) {
 		obj_cgroup_charge_zswap(objcg, entry->length);
@@ -1511,23 +1543,112 @@ bool zswap_store(struct folio *folio)
 store_failed:
 	zpool_free(entry->pool->zpool, entry->handle);
 put_pool:
-	zswap_pool_put(entry->pool);
+	zswap_pool_put(pool);
 freepage:
 	zswap_entry_cache_free(entry);
 reject:
 	obj_cgroup_put(objcg);
 	if (zswap_pool_reached_full)
 		queue_work(shrink_wq, &zswap_shrink_work);
-check_old:
+
+	return false;
+}
+
+/*
+ * Modified to store mTHP folios. Each page in the mTHP will be compressed
+ * and stored sequentially.
+ */
+bool zswap_store(struct folio *folio)
+{
+	long nr_pages = folio_nr_pages(folio);
+	swp_entry_t swp = folio->swap;
+	pgoff_t offset = swp_offset(swp);
+	struct xarray *tree = swap_zswap_tree(swp);
+	struct obj_cgroup *objcg = NULL;
+	struct mem_cgroup *memcg = NULL;
+	struct zswap_pool *pool;
+	bool ret = false;
+	long index;
+
+	VM_WARN_ON_ONCE(!folio_test_locked(folio));
+	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+
+	/* Storing large folios isn't enabled */
+	if (!zswap_mthp_enabled && folio_test_large(folio))
+		return false;
+
+	if (!zswap_enabled)
+		goto reject;
+
 	/*
-	 * If the zswap store fails or zswap is disabled, we must invalidate the
-	 * possibly stale entry which was previously stored at this offset.
-	 * Otherwise, writeback could overwrite the new data in the swapfile.
+	 * Check cgroup limits:
+	 *
+	 * The cgroup zswap limit check is done once at the beginning of an
+	 * mTHP store, and not within zswap_store_page() for each page
+	 * in the mTHP. We do however check the zswap pool limits at the
+	 * start of zswap_store_page(). What this means is, the cgroup
+	 * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
+	 * However, the per-store-page zswap pool limits check should
+	 * hopefully trigger the cgroup aware and zswap LRU aware global
+	 * reclaim implemented in the shrinker. If this assumption holds,
+	 * the cgroup exceeding the zswap limits could potentially be
+	 * resolved before the next zswap_store, and if it is not, the next
+	 * zswap_store would fail the cgroup zswap limit check at the start.
 	 */
-	entry = xa_erase(tree, offset);
-	if (entry)
-		zswap_entry_free(entry);
-	return false;
+	objcg = get_obj_cgroup_from_folio(folio);
+	if (objcg && !obj_cgroup_may_zswap(objcg)) {
+		memcg = get_mem_cgroup_from_objcg(objcg);
+		if (shrink_memcg(memcg)) {
+			mem_cgroup_put(memcg);
+			goto put_objcg;
+		}
+		mem_cgroup_put(memcg);
+	}
+
+	if (zswap_check_limits())
+		goto put_objcg;
+
+	pool = zswap_pool_current_get();
+	if (!pool)
+		goto put_objcg;
+
+	if (objcg) {
+		memcg = get_mem_cgroup_from_objcg(objcg);
+		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
+			mem_cgroup_put(memcg);
+			goto put_pool;
+		}
+		mem_cgroup_put(memcg);
+	}
+
+	/*
+	 * Store each page of the folio as a separate entry. If we fail to store
+	 * a page, unwind by removing all the previous pages we stored.
+	 */
+	for (index = 0; index < nr_pages; ++index) {
+		if (!zswap_store_page(folio, index, objcg, pool))
+			goto put_pool;
+	}
+
+	ret = true;
+
+put_pool:
+	zswap_pool_put(pool);
+put_objcg:
+	obj_cgroup_put(objcg);
+	if (zswap_pool_reached_full)
+		queue_work(shrink_wq, &zswap_shrink_work);
+reject:
+	/*
+	 * If the zswap store fails or zswap is disabled, we must invalidate
+	 * the possibly stale entries which were previously stored at the
+	 * offsets corresponding to each page of the folio. Otherwise,
+	 * writeback could overwrite the new data in the swapfile.
+	 */
+	if (!ret)
+		zswap_delete_stored_offsets(tree, offset, nr_pages);
+
+	return ret;
 }
 
 bool zswap_load(struct folio *folio)
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle mTHP folios.
  2024-08-29 21:27 ` [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar
@ 2024-08-29 23:06   ` Yosry Ahmed
  2024-09-20  1:57     ` Sridhar, Kanchana P
  2024-09-02 11:37   ` Chengming Zhou
  2024-09-16  5:55   ` Barry Song
  2 siblings, 1 reply; 34+ messages in thread
From: Yosry Ahmed @ 2024-08-29 23:06 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou,
	wajdi.k.feghali, vinodh.gopal

On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>

I think "mm: zswap: support mTHP swapout in zswap_store()" is a better
subject. We usually use imperative tone for commit logs as much as
possible.

> zswap_store() will now process and store mTHP and PMD-size THP folios.
>
> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> will enable/disable zswap storing of (m)THP.
>
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
>
>   "[RFC,v1] mm: zswap: Store large folios without splitting"
>
>   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> This patch provides a sequential implementation of storing an mTHP in
> zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
>
> Towards this goal, zswap_compress() is modified to take a page instead
> of a folio as input.
>
> Each page's swap offset is stored as a separate zswap entry.
>
> If an error is encountered during the store of any page in the mTHP,
> all previous pages/entries stored will be invalidated. Thus, an mTHP
> is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
>
> This forms the basis for building batching of pages during zswap store
> of large folios, by compressing batches of up to say, 8 pages in an
> mTHP in parallel in hardware, with the Intel In-Memory Analytics
> Accelerator (Intel IAA).
>
> Also, addressed some of the RFC comments from the discussion in [1].
>
> Made a minor edit in the comments for "struct zswap_entry" to delete
> the comments related to "value" since same-filled page handling has
> been removed from zswap.

This commit log is not ordered clearly. Please start by describing
what we are doing, which is basically making zswap_store() support
large folios by compressing them page by page. Then mention important
implementation details and the tunabel and config options added at the
end. After that, refer to the RFC that this is based on.

>
> Co-developed-by: Ryan Roberts
> Signed-off-by:

This is probably supposed to be "Signed-off-by: Ryan Roberts". I am
not sure what the policy is for reusing patches sent earlier on the
mailing list. Did you talk to Ryan about this?

> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

The diff is hard to follow because there is a lot of refactoring mixed
in with the functional changes. Could you please break this down into
purely refactoring patches doing the groundwork, and then the actual
functional change patch(es) on top of them?

Thanks!


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle mTHP folios.
  2024-08-29 23:06   ` Yosry Ahmed
@ 2024-09-20  1:57     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 34+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-20  1:57 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, nphamcs@gmail.com, chengming.zhou@linux.dev,
	usamaarif642@gmail.com, ryan.roberts@arm.com, Huang, Ying,
	21cnbao@gmail.com, akpm@linux-foundation.org, Zou, Nanhai,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Thursday, August 29, 2024 4:06 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle
> mTHP folios.
> 
> On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> 
> I think "mm: zswap: support mTHP swapout in zswap_store()" is a better
> subject. We usually use imperative tone for commit logs as much as
> possible.

Sure, this is a much better subject, thanks! I will make this change in v7.

> 
> > zswap_store() will now process and store mTHP and PMD-size THP folios.
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by
> default)
> > will enable/disable zswap storing of (m)THP.
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > This patch provides a sequential implementation of storing an mTHP in
> > zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > Towards this goal, zswap_compress() is modified to take a page instead
> > of a folio as input.
> >
> > Each page's swap offset is stored as a separate zswap entry.
> >
> > If an error is encountered during the store of any page in the mTHP,
> > all previous pages/entries stored will be invalidated. Thus, an mTHP
> > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
> >
> > This forms the basis for building batching of pages during zswap store
> > of large folios, by compressing batches of up to say, 8 pages in an
> > mTHP in parallel in hardware, with the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Made a minor edit in the comments for "struct zswap_entry" to delete
> > the comments related to "value" since same-filled page handling has
> > been removed from zswap.
> 
> This commit log is not ordered clearly. Please start by describing
> what we are doing, which is basically making zswap_store() support
> large folios by compressing them page by page. Then mention important
> implementation details and the tunabel and config options added at the
> end. After that, refer to the RFC that this is based on.

Thanks for these comments. Sure, I will incorporate in v7.

> 
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> 
> This is probably supposed to be "Signed-off-by: Ryan Roberts". I am
> not sure what the policy is for reusing patches sent earlier on the
> mailing list. Did you talk to Ryan about this?

You're right, this is intended to be "Signed-off-by: Ryan Roberts" once
Ryan has had a chance to review and indicate approval of attribution
as co-author.

I have been following the documentation guidelines for submitting
patches, as pertaining to co-development. Ryan is in the recipients list
and I am hoping he can indicate his approval for the reuse of his original
RFC.

Ryan, I would greatly appreciate your inputs on the reuse of your RFC,
and also any code review comments for improving the patchset!

> 
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> 
> The diff is hard to follow because there is a lot of refactoring mixed
> in with the functional changes. Could you please break this down into
> purely refactoring patches doing the groundwork, and then the actual
> functional change patch(es) on top of them?

Sure, I will do this and submit a v7. Appreciate your comments!

Thanks,
Kanchana

> 
> Thanks!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle mTHP folios.
  2024-08-29 21:27 ` [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar
  2024-08-29 23:06   ` Yosry Ahmed
@ 2024-09-02 11:37   ` Chengming Zhou
  2024-09-20  2:43     ` Sridhar, Kanchana P
  2024-09-16  5:55   ` Barry Song
  2 siblings, 1 reply; 34+ messages in thread
From: Chengming Zhou @ 2024-09-02 11:37 UTC (permalink / raw)
  To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosryahmed,
	nphamcs, usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal

On 2024/8/30 05:27, Kanchana P Sridhar wrote:
> zswap_store() will now process and store mTHP and PMD-size THP folios.
> 
> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> will enable/disable zswap storing of (m)THP.
> 
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
> 
>    "[RFC,v1] mm: zswap: Store large folios without splitting"
> 
>    [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
> 
> This patch provides a sequential implementation of storing an mTHP in
> zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
> 
> Towards this goal, zswap_compress() is modified to take a page instead
> of a folio as input.
> 
> Each page's swap offset is stored as a separate zswap entry.
> 
> If an error is encountered during the store of any page in the mTHP,
> all previous pages/entries stored will be invalidated. Thus, an mTHP
> is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
> 
> This forms the basis for building batching of pages during zswap store
> of large folios, by compressing batches of up to say, 8 pages in an
> mTHP in parallel in hardware, with the Intel In-Memory Analytics
> Accelerator (Intel IAA).
> 
> Also, addressed some of the RFC comments from the discussion in [1].
> 
> Made a minor edit in the comments for "struct zswap_entry" to delete
> the comments related to "value" since same-filled page handling has
> been removed from zswap.
> 
> Co-developed-by: Ryan Roberts
> Signed-off-by:
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

The code looks ok, but I also find this patch is a little hard to 
review, maybe it's better to split it into small patches as Yosry suggested.

Thanks!

[...]
> +
> +/*
> + * Modified to store mTHP folios. Each page in the mTHP will be compressed
> + * and stored sequentially.
> + */
> +bool zswap_store(struct folio *folio)
> +{
> +	long nr_pages = folio_nr_pages(folio);
> +	swp_entry_t swp = folio->swap;
> +	pgoff_t offset = swp_offset(swp);
> +	struct xarray *tree = swap_zswap_tree(swp);
> +	struct obj_cgroup *objcg = NULL;
> +	struct mem_cgroup *memcg = NULL;
> +	struct zswap_pool *pool;
> +	bool ret = false;
> +	long index;
> +
> +	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> +	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +
> +	/* Storing large folios isn't enabled */
> +	if (!zswap_mthp_enabled && folio_test_large(folio))
> +		return false;
> +
> +	if (!zswap_enabled)
> +		goto reject;
> +
>   	/*
> -	 * If the zswap store fails or zswap is disabled, we must invalidate the
> -	 * possibly stale entry which was previously stored at this offset.
> -	 * Otherwise, writeback could overwrite the new data in the swapfile.
> +	 * Check cgroup limits:
> +	 *
> +	 * The cgroup zswap limit check is done once at the beginning of an
> +	 * mTHP store, and not within zswap_store_page() for each page
> +	 * in the mTHP. We do however check the zswap pool limits at the
> +	 * start of zswap_store_page(). What this means is, the cgroup
> +	 * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> +	 * However, the per-store-page zswap pool limits check should
> +	 * hopefully trigger the cgroup aware and zswap LRU aware global
> +	 * reclaim implemented in the shrinker. If this assumption holds,
> +	 * the cgroup exceeding the zswap limits could potentially be
> +	 * resolved before the next zswap_store, and if it is not, the next
> +	 * zswap_store would fail the cgroup zswap limit check at the start.
>   	 */
> -	entry = xa_erase(tree, offset);
> -	if (entry)
> -		zswap_entry_free(entry);
> -	return false;
> +	objcg = get_obj_cgroup_from_folio(folio);
> +	if (objcg && !obj_cgroup_may_zswap(objcg)) {
> +		memcg = get_mem_cgroup_from_objcg(objcg);
> +		if (shrink_memcg(memcg)) {
> +			mem_cgroup_put(memcg);
> +			goto put_objcg;
> +		}
> +		mem_cgroup_put(memcg);
> +	}
> +
> +	if (zswap_check_limits())
> +		goto put_objcg;
> +
> +	pool = zswap_pool_current_get();
> +	if (!pool)
> +		goto put_objcg;
> +
> +	if (objcg) {
> +		memcg = get_mem_cgroup_from_objcg(objcg);
> +		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> +			mem_cgroup_put(memcg);
> +			goto put_pool;
> +		}
> +		mem_cgroup_put(memcg);
> +	}
> +
> +	/*
> +	 * Store each page of the folio as a separate entry. If we fail to store
> +	 * a page, unwind by removing all the previous pages we stored.
> +	 */
> +	for (index = 0; index < nr_pages; ++index) {
> +		if (!zswap_store_page(folio, index, objcg, pool))
> +			goto put_pool;
> +	}
> +
> +	ret = true;
> +
> +put_pool:
> +	zswap_pool_put(pool);
> +put_objcg:
> +	obj_cgroup_put(objcg);
> +	if (zswap_pool_reached_full)
> +		queue_work(shrink_wq, &zswap_shrink_work);
> +reject:
> +	/*
> +	 * If the zswap store fails or zswap is disabled, we must invalidate
> +	 * the possibly stale entries which were previously stored at the
> +	 * offsets corresponding to each page of the folio. Otherwise,
> +	 * writeback could overwrite the new data in the swapfile.
> +	 */
> +	if (!ret)
> +		zswap_delete_stored_offsets(tree, offset, nr_pages);
> +
> +	return ret;
>   }
>   
>   bool zswap_load(struct folio *folio)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle mTHP folios.
  2024-09-02 11:37   ` Chengming Zhou
@ 2024-09-20  2:43     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 34+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-20  2:43 UTC (permalink / raw)
  To: Chengming Zhou, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosryahmed@google.com, nphamcs@gmail.com,
	usamaarif642@gmail.com, ryan.roberts@arm.com, Huang, Ying,
	21cnbao@gmail.com, akpm@linux-foundation.org, Sridhar, Kanchana P
  Cc: Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh

Hi Chengming,

> -----Original Message-----
> From: Chengming Zhou <chengming.zhou@linux.dev>
> Sent: Monday, September 2, 2024 4:38 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> yosryahmed@google.com; nphamcs@gmail.com; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org
> Cc: Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle
> mTHP folios.
> 
> On 2024/8/30 05:27, Kanchana P Sridhar wrote:
> > zswap_store() will now process and store mTHP and PMD-size THP folios.
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by
> default)
> > will enable/disable zswap storing of (m)THP.
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >    "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >    [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > This patch provides a sequential implementation of storing an mTHP in
> > zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > Towards this goal, zswap_compress() is modified to take a page instead
> > of a folio as input.
> >
> > Each page's swap offset is stored as a separate zswap entry.
> >
> > If an error is encountered during the store of any page in the mTHP,
> > all previous pages/entries stored will be invalidated. Thus, an mTHP
> > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
> >
> > This forms the basis for building batching of pages during zswap store
> > of large folios, by compressing batches of up to say, 8 pages in an
> > mTHP in parallel in hardware, with the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Made a minor edit in the comments for "struct zswap_entry" to delete
> > the comments related to "value" since same-filled page handling has
> > been removed from zswap.
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> 
> The code looks ok, but I also find this patch is a little hard to
> review, maybe it's better to split it into small patches as Yosry suggested.

Definitely, will do so and submit a v7.

Thanks,
Kanchana

> 
> Thanks!
> 
> [...]
> > +
> > +/*
> > + * Modified to store mTHP folios. Each page in the mTHP will be
> compressed
> > + * and stored sequentially.
> > + */
> > +bool zswap_store(struct folio *folio)
> > +{
> > +	long nr_pages = folio_nr_pages(folio);
> > +	swp_entry_t swp = folio->swap;
> > +	pgoff_t offset = swp_offset(swp);
> > +	struct xarray *tree = swap_zswap_tree(swp);
> > +	struct obj_cgroup *objcg = NULL;
> > +	struct mem_cgroup *memcg = NULL;
> > +	struct zswap_pool *pool;
> > +	bool ret = false;
> > +	long index;
> > +
> > +	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > +	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +
> > +	/* Storing large folios isn't enabled */
> > +	if (!zswap_mthp_enabled && folio_test_large(folio))
> > +		return false;
> > +
> > +	if (!zswap_enabled)
> > +		goto reject;
> > +
> >   	/*
> > -	 * If the zswap store fails or zswap is disabled, we must invalidate the
> > -	 * possibly stale entry which was previously stored at this offset.
> > -	 * Otherwise, writeback could overwrite the new data in the swapfile.
> > +	 * Check cgroup limits:
> > +	 *
> > +	 * The cgroup zswap limit check is done once at the beginning of an
> > +	 * mTHP store, and not within zswap_store_page() for each page
> > +	 * in the mTHP. We do however check the zswap pool limits at the
> > +	 * start of zswap_store_page(). What this means is, the cgroup
> > +	 * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > +	 * However, the per-store-page zswap pool limits check should
> > +	 * hopefully trigger the cgroup aware and zswap LRU aware global
> > +	 * reclaim implemented in the shrinker. If this assumption holds,
> > +	 * the cgroup exceeding the zswap limits could potentially be
> > +	 * resolved before the next zswap_store, and if it is not, the next
> > +	 * zswap_store would fail the cgroup zswap limit check at the start.
> >   	 */
> > -	entry = xa_erase(tree, offset);
> > -	if (entry)
> > -		zswap_entry_free(entry);
> > -	return false;
> > +	objcg = get_obj_cgroup_from_folio(folio);
> > +	if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > +		memcg = get_mem_cgroup_from_objcg(objcg);
> > +		if (shrink_memcg(memcg)) {
> > +			mem_cgroup_put(memcg);
> > +			goto put_objcg;
> > +		}
> > +		mem_cgroup_put(memcg);
> > +	}
> > +
> > +	if (zswap_check_limits())
> > +		goto put_objcg;
> > +
> > +	pool = zswap_pool_current_get();
> > +	if (!pool)
> > +		goto put_objcg;
> > +
> > +	if (objcg) {
> > +		memcg = get_mem_cgroup_from_objcg(objcg);
> > +		if (memcg_list_lru_alloc(memcg, &zswap_list_lru,
> GFP_KERNEL)) {
> > +			mem_cgroup_put(memcg);
> > +			goto put_pool;
> > +		}
> > +		mem_cgroup_put(memcg);
> > +	}
> > +
> > +	/*
> > +	 * Store each page of the folio as a separate entry. If we fail to store
> > +	 * a page, unwind by removing all the previous pages we stored.
> > +	 */
> > +	for (index = 0; index < nr_pages; ++index) {
> > +		if (!zswap_store_page(folio, index, objcg, pool))
> > +			goto put_pool;
> > +	}
> > +
> > +	ret = true;
> > +
> > +put_pool:
> > +	zswap_pool_put(pool);
> > +put_objcg:
> > +	obj_cgroup_put(objcg);
> > +	if (zswap_pool_reached_full)
> > +		queue_work(shrink_wq, &zswap_shrink_work);
> > +reject:
> > +	/*
> > +	 * If the zswap store fails or zswap is disabled, we must invalidate
> > +	 * the possibly stale entries which were previously stored at the
> > +	 * offsets corresponding to each page of the folio. Otherwise,
> > +	 * writeback could overwrite the new data in the swapfile.
> > +	 */
> > +	if (!ret)
> > +		zswap_delete_stored_offsets(tree, offset, nr_pages);
> > +
> > +	return ret;
> >   }
> >
> >   bool zswap_load(struct folio *folio)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle mTHP folios.
  2024-08-29 21:27 ` [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar
  2024-08-29 23:06   ` Yosry Ahmed
  2024-09-02 11:37   ` Chengming Zhou
@ 2024-09-16  5:55   ` Barry Song
  2024-09-20 20:53     ` Sridhar, Kanchana P
  2 siblings, 1 reply; 34+ messages in thread
From: Barry Song @ 2024-09-16  5:55 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, akpm,
	nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Fri, Aug 30, 2024 at 5:27 AM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> zswap_store() will now process and store mTHP and PMD-size THP folios.
>
> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> will enable/disable zswap storing of (m)THP.
>
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
>
>   "[RFC,v1] mm: zswap: Store large folios without splitting"
>
>   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> This patch provides a sequential implementation of storing an mTHP in
> zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
>
> Towards this goal, zswap_compress() is modified to take a page instead
> of a folio as input.
>
> Each page's swap offset is stored as a separate zswap entry.
>
> If an error is encountered during the store of any page in the mTHP,
> all previous pages/entries stored will be invalidated. Thus, an mTHP
> is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
>
> This forms the basis for building batching of pages during zswap store
> of large folios, by compressing batches of up to say, 8 pages in an
> mTHP in parallel in hardware, with the Intel In-Memory Analytics
> Accelerator (Intel IAA).

Hi Kanchana,
I'm not opposed to this patch, but I don't understand how iterating
through each page within an mTHP supports the use of Intel IAA,
as it involves compressing pages individually.

In the document 'by_n compression and decompression with Intel IAA' by
Andre Glover
(https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.com),
it appears
that zsmalloc/zram needs to support multi-page compression and
decompression to fully
leverage the hardware's capabilities. Could you clarify how this
approach fits in?

In patch2/3 of that series, you have:
"Add the 'by_n' attribute to the acomp_req. The 'by_n' attribute can be
used a directive by acomp crypto algorithms for splitting compress and
decompress operations into "n" separate jobs."

How can you apply 'by_n' to a single page rather than to a large folio?

>
> Also, addressed some of the RFC comments from the discussion in [1].
>
> Made a minor edit in the comments for "struct zswap_entry" to delete
> the comments related to "value" since same-filled page handling has
> been removed from zswap.
>
> Co-developed-by: Ryan Roberts
> Signed-off-by:
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/Kconfig |   8 ++
>  mm/zswap.c | 243 +++++++++++++++++++++++++++++++++++++++--------------
>  2 files changed, 190 insertions(+), 61 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index b23913d4e47e..68c7b01120bd 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
>           reducing the chance that cold pages will reside in the zswap pool
>           and consume memory indefinitely.
>
> +config ZSWAP_STORE_THP_DEFAULT_ON
> +       bool "Store mTHP and THP folios in zswap"
> +       depends on ZSWAP
> +       default n
> +       help
> +         If selected, zswap will process mTHP and THP folios by
> +         compressing and storing each 4K page in the large folio.
> +
>  choice
>         prompt "Default compressor"
>         depends on ZSWAP
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 449914ea9919..3abf9810f0b7 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
>                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
>  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
>
> +/*
> + * Enable/disable zswap processing of mTHP folios.
> + * For now, only zswap_store will process mTHP folios.
> + */
> +static bool zswap_mthp_enabled = IS_ENABLED(
> +               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
> +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644);
> +
>  bool zswap_is_enabled(void)
>  {
>         return zswap_enabled;
> @@ -190,7 +198,6 @@ static struct shrinker *zswap_shrinker;
>   *              section for context.
>   * pool - the zswap_pool the entry's data is in
>   * handle - zpool allocation handle that stores the compressed page data
> - * value - value of the same-value filled pages which have same content
>   * objcg - the obj_cgroup that the compressed memory is charged to
>   * lru - handle to the pool's lru used to evict pages.
>   */
> @@ -876,7 +883,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
>         return 0;
>  }
>
> -static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
> +static bool zswap_compress(struct page *page, struct zswap_entry *entry)
>  {
>         struct crypto_acomp_ctx *acomp_ctx;
>         struct scatterlist input, output;
> @@ -894,7 +901,7 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
>
>         dst = acomp_ctx->buffer;
>         sg_init_table(&input, 1);
> -       sg_set_folio(&input, folio, PAGE_SIZE, 0);
> +       sg_set_page(&input, page, PAGE_SIZE, 0);
>
>         /*
>          * We need PAGE_SIZE * 2 here since there maybe over-compression case,
> @@ -1404,35 +1411,82 @@ static void shrink_worker(struct work_struct *w)
>  /*********************************
>  * main API
>  **********************************/
> -bool zswap_store(struct folio *folio)
> +
> +/*
> + * Returns true if the entry was successfully
> + * stored in the xarray, and false otherwise.
> + */
> +static bool zswap_store_entry(struct xarray *tree,
> +                             struct zswap_entry *entry)
>  {
> -       swp_entry_t swp = folio->swap;
> -       pgoff_t offset = swp_offset(swp);
> -       struct xarray *tree = swap_zswap_tree(swp);
> -       struct zswap_entry *entry, *old;
> -       struct obj_cgroup *objcg = NULL;
> -       struct mem_cgroup *memcg = NULL;
> +       struct zswap_entry *old;
> +       pgoff_t offset = swp_offset(entry->swpentry);
>
> -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +       old = xa_store(tree, offset, entry, GFP_KERNEL);
>
> -       /* Large folios aren't supported */
> -       if (folio_test_large(folio))
> +       if (xa_is_err(old)) {
> +               int err = xa_err(old);
> +
> +               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> +               zswap_reject_alloc_fail++;
>                 return false;
> +       }
>
> -       if (!zswap_enabled)
> -               goto check_old;
> +       /*
> +        * We may have had an existing entry that became stale when
> +        * the folio was redirtied and now the new version is being
> +        * swapped out. Get rid of the old.
> +        */
> +       if (old)
> +               zswap_entry_free(old);
>
> -       /* Check cgroup limits */
> -       objcg = get_obj_cgroup_from_folio(folio);
> -       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> -               memcg = get_mem_cgroup_from_objcg(objcg);
> -               if (shrink_memcg(memcg)) {
> -                       mem_cgroup_put(memcg);
> -                       goto reject;
> -               }
> -               mem_cgroup_put(memcg);
> +       return true;
> +}
> +
> +/*
> + * If the zswap store fails or zswap is disabled, we must invalidate the
> + * possibly stale entries which were previously stored at the offsets
> + * corresponding to each page of the folio. Otherwise, writeback could
> + * overwrite the new data in the swapfile.
> + *
> + * This is called after the store of the i-th offset in a large folio has
> + * failed. All zswap entries in the folio must be deleted. This helps make
> + * sure that a swapped-out mTHP is either entirely stored in zswap, or
> + * entirely not stored in zswap.
> + *
> + * This is also called if zswap_store() is invoked, but zswap is not enabled.
> + * All offsets for the folio are deleted from zswap in this case.
> + */
> +static void zswap_delete_stored_offsets(struct xarray *tree,
> +                                       pgoff_t offset,
> +                                       long nr_pages)
> +{
> +       struct zswap_entry *entry;
> +       long i;
> +
> +       for (i = 0; i < nr_pages; ++i) {
> +               entry = xa_erase(tree, offset + i);
> +               if (entry)
> +                       zswap_entry_free(entry);
>         }
> +}
> +
> +/*
> + * Stores the page at specified "index" in a folio.
> + */
> +static bool zswap_store_page(struct folio *folio, long index,
> +                            struct obj_cgroup *objcg,
> +                            struct zswap_pool *pool)
> +{
> +       swp_entry_t swp = folio->swap;
> +       int type = swp_type(swp);
> +       pgoff_t offset = swp_offset(swp) + index;
> +       struct page *page = folio_page(folio, index);
> +       struct xarray *tree = swap_zswap_tree(swp);
> +       struct zswap_entry *entry;
> +
> +       if (objcg)
> +               obj_cgroup_get(objcg);
>
>         if (zswap_check_limits())
>                 goto reject;
> @@ -1445,42 +1499,20 @@ bool zswap_store(struct folio *folio)
>         }
>
>         /* if entry is successfully added, it keeps the reference */
> -       entry->pool = zswap_pool_current_get();
> -       if (!entry->pool)
> +       if (!zswap_pool_get(pool))
>                 goto freepage;
>
> -       if (objcg) {
> -               memcg = get_mem_cgroup_from_objcg(objcg);
> -               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> -                       mem_cgroup_put(memcg);
> -                       goto put_pool;
> -               }
> -               mem_cgroup_put(memcg);
> -       }
> +       entry->pool = pool;
>
> -       if (!zswap_compress(folio, entry))
> +       if (!zswap_compress(page, entry))
>                 goto put_pool;
>
> -       entry->swpentry = swp;
> +       entry->swpentry = swp_entry(type, offset);
>         entry->objcg = objcg;
>         entry->referenced = true;
>
> -       old = xa_store(tree, offset, entry, GFP_KERNEL);
> -       if (xa_is_err(old)) {
> -               int err = xa_err(old);
> -
> -               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> -               zswap_reject_alloc_fail++;
> +       if (!zswap_store_entry(tree, entry))
>                 goto store_failed;
> -       }
> -
> -       /*
> -        * We may have had an existing entry that became stale when
> -        * the folio was redirtied and now the new version is being
> -        * swapped out. Get rid of the old.
> -        */
> -       if (old)
> -               zswap_entry_free(old);
>
>         if (objcg) {
>                 obj_cgroup_charge_zswap(objcg, entry->length);
> @@ -1511,23 +1543,112 @@ bool zswap_store(struct folio *folio)
>  store_failed:
>         zpool_free(entry->pool->zpool, entry->handle);
>  put_pool:
> -       zswap_pool_put(entry->pool);
> +       zswap_pool_put(pool);
>  freepage:
>         zswap_entry_cache_free(entry);
>  reject:
>         obj_cgroup_put(objcg);
>         if (zswap_pool_reached_full)
>                 queue_work(shrink_wq, &zswap_shrink_work);
> -check_old:
> +
> +       return false;
> +}
> +
> +/*
> + * Modified to store mTHP folios. Each page in the mTHP will be compressed
> + * and stored sequentially.
> + */
> +bool zswap_store(struct folio *folio)
> +{
> +       long nr_pages = folio_nr_pages(folio);
> +       swp_entry_t swp = folio->swap;
> +       pgoff_t offset = swp_offset(swp);
> +       struct xarray *tree = swap_zswap_tree(swp);
> +       struct obj_cgroup *objcg = NULL;
> +       struct mem_cgroup *memcg = NULL;
> +       struct zswap_pool *pool;
> +       bool ret = false;
> +       long index;
> +
> +       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> +       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +
> +       /* Storing large folios isn't enabled */
> +       if (!zswap_mthp_enabled && folio_test_large(folio))
> +               return false;
> +
> +       if (!zswap_enabled)
> +               goto reject;
> +
>         /*
> -        * If the zswap store fails or zswap is disabled, we must invalidate the
> -        * possibly stale entry which was previously stored at this offset.
> -        * Otherwise, writeback could overwrite the new data in the swapfile.
> +        * Check cgroup limits:
> +        *
> +        * The cgroup zswap limit check is done once at the beginning of an
> +        * mTHP store, and not within zswap_store_page() for each page
> +        * in the mTHP. We do however check the zswap pool limits at the
> +        * start of zswap_store_page(). What this means is, the cgroup
> +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> +        * However, the per-store-page zswap pool limits check should
> +        * hopefully trigger the cgroup aware and zswap LRU aware global
> +        * reclaim implemented in the shrinker. If this assumption holds,
> +        * the cgroup exceeding the zswap limits could potentially be
> +        * resolved before the next zswap_store, and if it is not, the next
> +        * zswap_store would fail the cgroup zswap limit check at the start.
>          */
> -       entry = xa_erase(tree, offset);
> -       if (entry)
> -               zswap_entry_free(entry);
> -       return false;
> +       objcg = get_obj_cgroup_from_folio(folio);
> +       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> +               memcg = get_mem_cgroup_from_objcg(objcg);
> +               if (shrink_memcg(memcg)) {
> +                       mem_cgroup_put(memcg);
> +                       goto put_objcg;
> +               }
> +               mem_cgroup_put(memcg);
> +       }
> +
> +       if (zswap_check_limits())
> +               goto put_objcg;
> +
> +       pool = zswap_pool_current_get();
> +       if (!pool)
> +               goto put_objcg;
> +
> +       if (objcg) {
> +               memcg = get_mem_cgroup_from_objcg(objcg);
> +               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> +                       mem_cgroup_put(memcg);
> +                       goto put_pool;
> +               }
> +               mem_cgroup_put(memcg);
> +       }
> +
> +       /*
> +        * Store each page of the folio as a separate entry. If we fail to store
> +        * a page, unwind by removing all the previous pages we stored.
> +        */
> +       for (index = 0; index < nr_pages; ++index) {
> +               if (!zswap_store_page(folio, index, objcg, pool))
> +                       goto put_pool;
> +       }
> +
> +       ret = true;
> +
> +put_pool:
> +       zswap_pool_put(pool);
> +put_objcg:
> +       obj_cgroup_put(objcg);
> +       if (zswap_pool_reached_full)
> +               queue_work(shrink_wq, &zswap_shrink_work);
> +reject:
> +       /*
> +        * If the zswap store fails or zswap is disabled, we must invalidate
> +        * the possibly stale entries which were previously stored at the
> +        * offsets corresponding to each page of the folio. Otherwise,
> +        * writeback could overwrite the new data in the swapfile.
> +        */
> +       if (!ret)
> +               zswap_delete_stored_offsets(tree, offset, nr_pages);
> +
> +       return ret;
>  }
>
>  bool zswap_load(struct folio *folio)
> --
> 2.27.0
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle mTHP folios.
  2024-09-16  5:55   ` Barry Song
@ 2024-09-20 20:53     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 34+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-20 20:53 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosryahmed@google.com, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, Huang, Ying, akpm@linux-foundation.org,
	Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

Hi Barry,

> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Sunday, September 15, 2024 10:55 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle
> mTHP folios.
> 
> On Fri, Aug 30, 2024 at 5:27 AM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > zswap_store() will now process and store mTHP and PMD-size THP folios.
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by
> default)
> > will enable/disable zswap storing of (m)THP.
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > This patch provides a sequential implementation of storing an mTHP in
> > zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > Towards this goal, zswap_compress() is modified to take a page instead
> > of a folio as input.
> >
> > Each page's swap offset is stored as a separate zswap entry.
> >
> > If an error is encountered during the store of any page in the mTHP,
> > all previous pages/entries stored will be invalidated. Thus, an mTHP
> > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
> >
> > This forms the basis for building batching of pages during zswap store
> > of large folios, by compressing batches of up to say, 8 pages in an
> > mTHP in parallel in hardware, with the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> 
> Hi Kanchana,
> I'm not opposed to this patch, but I don't understand how iterating
> through each page within an mTHP supports the use of Intel IAA,
> as it involves compressing pages individually.

Thanks for your insightful comments and questions!

With Intel IAA, we have the opportunity to make use of compression
and decompression engines in hardware to do parallel compressions during
swapout and parallel decompressions during swapin with readahead
(and eventually mTHP swapin of larger compressed buffers when this is
ready). If compressions can be parallelized, we can improve reclaim
performance. If decompressions can be parallelized, we can improve
do_swap_page() performance.

We have implemented compress batching within mTHP folios during
zswap store, as well as compress batching of any-order folios during
shrink_folio_list() -- swap_writepage() using a plug mechanism, similar
to the existing swap_write_unplug() implementation. Initially, our
solution works at the granularity of compressing PAGE_SIZE pages within
(many) folios in parallel, to maximize throughput with IAA and minimize
latency per folio store.

With IAA, we are able to submit a batch of compress/decompress jobs
and poll for their completion asynchronously (RFCs yet to be submitted).
This brings the benefit of parallel compression/decompression in hardware
without waiting for the jobs to complete synchronously. With zswap_store
batching within an mTHP folio, the "batch" is comprised of up to say, 8 pages
in the mTHP. As mentioned above, we have extended this to construct batches
of any-order (m)THP folios during reclaim, that can be processed by zswap_store
compress batching.

We have also implemented decompress batching of 4K folios to improve
do_swap_page() performance using parallel decompression of a batch
of 4k folios. Using swapin_readahead(), we can prefetch a batch of 4k folios
in the kernel today. Decompress batching involves zswap_load of this
batch using parallel decompressions in IAA.

To utilize IAA compress/decompress engines, we have developed the
respective batching interfaces from shrink_folio_list() and from
swapin_readahead(). Our experiments in multi-instance, highly contended
server scenarios under memory pressure have demonstrated significant
kernel swapout/swapin latency improvements and workload level performance
improvements and overall system level memory savings as compared to
software compressors.

Needless to say, batching only improves performance in configurations
with Intel IAA, and it should not impact software compressors. 

> 
> In the document 'by_n compression and decompression with Intel IAA' by
> Andre Glover
> (https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.co
> m),
> it appears
> that zsmalloc/zram needs to support multi-page compression and
> decompression to fully
> leverage the hardware's capabilities. Could you clarify how this
> approach fits in?

We are also staying tuned in to the mTHP swapin progress being made by
yourself, Chuanhua and others. Our goal is to eventually be able to
swapout/swapin an mTHP as a single entity. In this case also, IAA byN
can compress/decompress a tunable number of chunks of an mTHP in
parallel [1].

As in earlier discussions, the IAA byN approach is dependent on the mTHP
swapin patchset [2] and associated zsmalloc/zram updates for storing larger
compressed buffers from ZRAM [3]. However, this will only address ZRAM.
Imho, this could be a more involved effort for ZSWAP, that would need the
mTHP swapin to be more generally applicable.

In the meantime, the IAA batching approach provides us a way to work with
the existing kernel support for mTHP swapout/swapin as pertaining to zswap.

[1] https://patchwork.kernel.org/project/linux-mm/cover/cover.1714581792.git.andre.glover@linux.intel.com/
[2] https://patchwork.kernel.org/project/linux-mm/cover/20240908232119.2157-1-21cnbao@gmail.com/
[3] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/

> 
> In patch2/3 of that series, you have:
> "Add the 'by_n' attribute to the acomp_req. The 'by_n' attribute can be
> used a directive by acomp crypto algorithms for splitting compress and
> decompress operations into "n" separate jobs."
> 
> How can you apply 'by_n' to a single page rather than to a large folio?

In [1], Andre had introduced IAA byN as a new 'canned-by_n' algorithm.
In theory, it should be possible to apply this to any size input buffers. Although,
most of our testing and data posted in [1] was focused on using 64k mTHP
swapout/swapin with zram and your initial patchsets for [2-a] and [3].

[1] https://patchwork.kernel.org/project/linux-mm/cover/cover.1714581792.git.andre.glover@linux.intel.com/
[2-a] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
[3] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/

Thanks,
Kanchana

> 
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Made a minor edit in the comments for "struct zswap_entry" to delete
> > the comments related to "value" since same-filled page handling has
> > been removed from zswap.
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/Kconfig |   8 ++
> >  mm/zswap.c | 243 +++++++++++++++++++++++++++++++++++++++--------
> ------
> >  2 files changed, 190 insertions(+), 61 deletions(-)
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index b23913d4e47e..68c7b01120bd 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
> >           reducing the chance that cold pages will reside in the zswap pool
> >           and consume memory indefinitely.
> >
> > +config ZSWAP_STORE_THP_DEFAULT_ON
> > +       bool "Store mTHP and THP folios in zswap"
> > +       depends on ZSWAP
> > +       default n
> > +       help
> > +         If selected, zswap will process mTHP and THP folios by
> > +         compressing and storing each 4K page in the large folio.
> > +
> >  choice
> >         prompt "Default compressor"
> >         depends on ZSWAP
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 449914ea9919..3abf9810f0b7 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled =
> IS_ENABLED(
> >                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
> >  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool,
> 0644);
> >
> > +/*
> > + * Enable/disable zswap processing of mTHP folios.
> > + * For now, only zswap_store will process mTHP folios.
> > + */
> > +static bool zswap_mthp_enabled = IS_ENABLED(
> > +               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
> > +module_param_named(mthp_enabled, zswap_mthp_enabled, bool,
> 0644);
> > +
> >  bool zswap_is_enabled(void)
> >  {
> >         return zswap_enabled;
> > @@ -190,7 +198,6 @@ static struct shrinker *zswap_shrinker;
> >   *              section for context.
> >   * pool - the zswap_pool the entry's data is in
> >   * handle - zpool allocation handle that stores the compressed page data
> > - * value - value of the same-value filled pages which have same content
> >   * objcg - the obj_cgroup that the compressed memory is charged to
> >   * lru - handle to the pool's lru used to evict pages.
> >   */
> > @@ -876,7 +883,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu,
> struct hlist_node *node)
> >         return 0;
> >  }
> >
> > -static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
> > +static bool zswap_compress(struct page *page, struct zswap_entry *entry)
> >  {
> >         struct crypto_acomp_ctx *acomp_ctx;
> >         struct scatterlist input, output;
> > @@ -894,7 +901,7 @@ static bool zswap_compress(struct folio *folio,
> struct zswap_entry *entry)
> >
> >         dst = acomp_ctx->buffer;
> >         sg_init_table(&input, 1);
> > -       sg_set_folio(&input, folio, PAGE_SIZE, 0);
> > +       sg_set_page(&input, page, PAGE_SIZE, 0);
> >
> >         /*
> >          * We need PAGE_SIZE * 2 here since there maybe over-compression
> case,
> > @@ -1404,35 +1411,82 @@ static void shrink_worker(struct work_struct
> *w)
> >  /*********************************
> >  * main API
> >  **********************************/
> > -bool zswap_store(struct folio *folio)
> > +
> > +/*
> > + * Returns true if the entry was successfully
> > + * stored in the xarray, and false otherwise.
> > + */
> > +static bool zswap_store_entry(struct xarray *tree,
> > +                             struct zswap_entry *entry)
> >  {
> > -       swp_entry_t swp = folio->swap;
> > -       pgoff_t offset = swp_offset(swp);
> > -       struct xarray *tree = swap_zswap_tree(swp);
> > -       struct zswap_entry *entry, *old;
> > -       struct obj_cgroup *objcg = NULL;
> > -       struct mem_cgroup *memcg = NULL;
> > +       struct zswap_entry *old;
> > +       pgoff_t offset = swp_offset(entry->swpentry);
> >
> > -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +       old = xa_store(tree, offset, entry, GFP_KERNEL);
> >
> > -       /* Large folios aren't supported */
> > -       if (folio_test_large(folio))
> > +       if (xa_is_err(old)) {
> > +               int err = xa_err(old);
> > +
> > +               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n",
> err);
> > +               zswap_reject_alloc_fail++;
> >                 return false;
> > +       }
> >
> > -       if (!zswap_enabled)
> > -               goto check_old;
> > +       /*
> > +        * We may have had an existing entry that became stale when
> > +        * the folio was redirtied and now the new version is being
> > +        * swapped out. Get rid of the old.
> > +        */
> > +       if (old)
> > +               zswap_entry_free(old);
> >
> > -       /* Check cgroup limits */
> > -       objcg = get_obj_cgroup_from_folio(folio);
> > -       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > -               memcg = get_mem_cgroup_from_objcg(objcg);
> > -               if (shrink_memcg(memcg)) {
> > -                       mem_cgroup_put(memcg);
> > -                       goto reject;
> > -               }
> > -               mem_cgroup_put(memcg);
> > +       return true;
> > +}
> > +
> > +/*
> > + * If the zswap store fails or zswap is disabled, we must invalidate the
> > + * possibly stale entries which were previously stored at the offsets
> > + * corresponding to each page of the folio. Otherwise, writeback could
> > + * overwrite the new data in the swapfile.
> > + *
> > + * This is called after the store of the i-th offset in a large folio has
> > + * failed. All zswap entries in the folio must be deleted. This helps make
> > + * sure that a swapped-out mTHP is either entirely stored in zswap, or
> > + * entirely not stored in zswap.
> > + *
> > + * This is also called if zswap_store() is invoked, but zswap is not enabled.
> > + * All offsets for the folio are deleted from zswap in this case.
> > + */
> > +static void zswap_delete_stored_offsets(struct xarray *tree,
> > +                                       pgoff_t offset,
> > +                                       long nr_pages)
> > +{
> > +       struct zswap_entry *entry;
> > +       long i;
> > +
> > +       for (i = 0; i < nr_pages; ++i) {
> > +               entry = xa_erase(tree, offset + i);
> > +               if (entry)
> > +                       zswap_entry_free(entry);
> >         }
> > +}
> > +
> > +/*
> > + * Stores the page at specified "index" in a folio.
> > + */
> > +static bool zswap_store_page(struct folio *folio, long index,
> > +                            struct obj_cgroup *objcg,
> > +                            struct zswap_pool *pool)
> > +{
> > +       swp_entry_t swp = folio->swap;
> > +       int type = swp_type(swp);
> > +       pgoff_t offset = swp_offset(swp) + index;
> > +       struct page *page = folio_page(folio, index);
> > +       struct xarray *tree = swap_zswap_tree(swp);
> > +       struct zswap_entry *entry;
> > +
> > +       if (objcg)
> > +               obj_cgroup_get(objcg);
> >
> >         if (zswap_check_limits())
> >                 goto reject;
> > @@ -1445,42 +1499,20 @@ bool zswap_store(struct folio *folio)
> >         }
> >
> >         /* if entry is successfully added, it keeps the reference */
> > -       entry->pool = zswap_pool_current_get();
> > -       if (!entry->pool)
> > +       if (!zswap_pool_get(pool))
> >                 goto freepage;
> >
> > -       if (objcg) {
> > -               memcg = get_mem_cgroup_from_objcg(objcg);
> > -               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> > -                       mem_cgroup_put(memcg);
> > -                       goto put_pool;
> > -               }
> > -               mem_cgroup_put(memcg);
> > -       }
> > +       entry->pool = pool;
> >
> > -       if (!zswap_compress(folio, entry))
> > +       if (!zswap_compress(page, entry))
> >                 goto put_pool;
> >
> > -       entry->swpentry = swp;
> > +       entry->swpentry = swp_entry(type, offset);
> >         entry->objcg = objcg;
> >         entry->referenced = true;
> >
> > -       old = xa_store(tree, offset, entry, GFP_KERNEL);
> > -       if (xa_is_err(old)) {
> > -               int err = xa_err(old);
> > -
> > -               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n",
> err);
> > -               zswap_reject_alloc_fail++;
> > +       if (!zswap_store_entry(tree, entry))
> >                 goto store_failed;
> > -       }
> > -
> > -       /*
> > -        * We may have had an existing entry that became stale when
> > -        * the folio was redirtied and now the new version is being
> > -        * swapped out. Get rid of the old.
> > -        */
> > -       if (old)
> > -               zswap_entry_free(old);
> >
> >         if (objcg) {
> >                 obj_cgroup_charge_zswap(objcg, entry->length);
> > @@ -1511,23 +1543,112 @@ bool zswap_store(struct folio *folio)
> >  store_failed:
> >         zpool_free(entry->pool->zpool, entry->handle);
> >  put_pool:
> > -       zswap_pool_put(entry->pool);
> > +       zswap_pool_put(pool);
> >  freepage:
> >         zswap_entry_cache_free(entry);
> >  reject:
> >         obj_cgroup_put(objcg);
> >         if (zswap_pool_reached_full)
> >                 queue_work(shrink_wq, &zswap_shrink_work);
> > -check_old:
> > +
> > +       return false;
> > +}
> > +
> > +/*
> > + * Modified to store mTHP folios. Each page in the mTHP will be
> compressed
> > + * and stored sequentially.
> > + */
> > +bool zswap_store(struct folio *folio)
> > +{
> > +       long nr_pages = folio_nr_pages(folio);
> > +       swp_entry_t swp = folio->swap;
> > +       pgoff_t offset = swp_offset(swp);
> > +       struct xarray *tree = swap_zswap_tree(swp);
> > +       struct obj_cgroup *objcg = NULL;
> > +       struct mem_cgroup *memcg = NULL;
> > +       struct zswap_pool *pool;
> > +       bool ret = false;
> > +       long index;
> > +
> > +       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > +       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +
> > +       /* Storing large folios isn't enabled */
> > +       if (!zswap_mthp_enabled && folio_test_large(folio))
> > +               return false;
> > +
> > +       if (!zswap_enabled)
> > +               goto reject;
> > +
> >         /*
> > -        * If the zswap store fails or zswap is disabled, we must invalidate the
> > -        * possibly stale entry which was previously stored at this offset.
> > -        * Otherwise, writeback could overwrite the new data in the swapfile.
> > +        * Check cgroup limits:
> > +        *
> > +        * The cgroup zswap limit check is done once at the beginning of an
> > +        * mTHP store, and not within zswap_store_page() for each page
> > +        * in the mTHP. We do however check the zswap pool limits at the
> > +        * start of zswap_store_page(). What this means is, the cgroup
> > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > +        * However, the per-store-page zswap pool limits check should
> > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > +        * reclaim implemented in the shrinker. If this assumption holds,
> > +        * the cgroup exceeding the zswap limits could potentially be
> > +        * resolved before the next zswap_store, and if it is not, the next
> > +        * zswap_store would fail the cgroup zswap limit check at the start.
> >          */
> > -       entry = xa_erase(tree, offset);
> > -       if (entry)
> > -               zswap_entry_free(entry);
> > -       return false;
> > +       objcg = get_obj_cgroup_from_folio(folio);
> > +       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > +               memcg = get_mem_cgroup_from_objcg(objcg);
> > +               if (shrink_memcg(memcg)) {
> > +                       mem_cgroup_put(memcg);
> > +                       goto put_objcg;
> > +               }
> > +               mem_cgroup_put(memcg);
> > +       }
> > +
> > +       if (zswap_check_limits())
> > +               goto put_objcg;
> > +
> > +       pool = zswap_pool_current_get();
> > +       if (!pool)
> > +               goto put_objcg;
> > +
> > +       if (objcg) {
> > +               memcg = get_mem_cgroup_from_objcg(objcg);
> > +               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> > +                       mem_cgroup_put(memcg);
> > +                       goto put_pool;
> > +               }
> > +               mem_cgroup_put(memcg);
> > +       }
> > +
> > +       /*
> > +        * Store each page of the folio as a separate entry. If we fail to store
> > +        * a page, unwind by removing all the previous pages we stored.
> > +        */
> > +       for (index = 0; index < nr_pages; ++index) {
> > +               if (!zswap_store_page(folio, index, objcg, pool))
> > +                       goto put_pool;
> > +       }
> > +
> > +       ret = true;
> > +
> > +put_pool:
> > +       zswap_pool_put(pool);
> > +put_objcg:
> > +       obj_cgroup_put(objcg);
> > +       if (zswap_pool_reached_full)
> > +               queue_work(shrink_wq, &zswap_shrink_work);
> > +reject:
> > +       /*
> > +        * If the zswap store fails or zswap is disabled, we must invalidate
> > +        * the possibly stale entries which were previously stored at the
> > +        * offsets corresponding to each page of the folio. Otherwise,
> > +        * writeback could overwrite the new data in the swapfile.
> > +        */
> > +       if (!ret)
> > +               zswap_delete_stored_offsets(tree, offset, nr_pages);
> > +
> > +       return ret;
> >  }
> >
> >  bool zswap_load(struct folio *folio)
> > --
> > 2.27.0
> >
> 
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v6 3/3] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats.
  2024-08-29 21:27 [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
  2024-08-29 21:27 ` [PATCH v6 1/3] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
  2024-08-29 21:27 ` [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar
@ 2024-08-29 21:27 ` Kanchana P Sridhar
  2024-08-30  0:19   ` Nhat Pham
  2024-09-20 22:57   ` Yosry Ahmed
  2024-08-29 22:48 ` [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios Yosry Ahmed
  2024-09-02 14:40 ` Usama Arif
  4 siblings, 2 replies; 34+ messages in thread
From: Kanchana P Sridhar @ 2024-08-29 21:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Add a new MTHP_STAT_ZSWPOUT entry to the sysfs mTHP stats so that
per-order mTHP folio ZSWAP stores can be accounted.

If zswap_store() successfully swaps out an mTHP, it will be counted under
the per-order sysfs "zswpout" stats:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout

Other block dev/fs mTHP swap-out events will be counted under
the existing sysfs "swpout" stats:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout

Based on changes made in commit 61e751c01466ffef5dc72cb64349454a691c6bfe
("mm: cleanup count_mthp_stat() definition"), this patch also moves
the call to count_mthp_stat() in count_swpout_vm_event() to be outside
the "ifdef CONFIG_TRANSPARENT_HUGEPAGE".

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/huge_mm.h | 1 +
 mm/huge_memory.c        | 3 +++
 mm/page_io.c            | 3 ++-
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4da102b74a8c..8b690328e78b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -118,6 +118,7 @@ enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
 	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
+	MTHP_STAT_ZSWPOUT,
 	MTHP_STAT_SWPOUT,
 	MTHP_STAT_SWPOUT_FALLBACK,
 	MTHP_STAT_SHMEM_ALLOC,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 15418ffdd377..ad921c4b2ad8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -587,6 +587,7 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK);
 #ifdef CONFIG_SHMEM
@@ -605,6 +606,7 @@ static struct attribute *anon_stats_attrs[] = {
 	&anon_fault_fallback_attr.attr,
 	&anon_fault_fallback_charge_attr.attr,
 #ifndef CONFIG_SHMEM
+	&zswpout_attr.attr,
 	&swpout_attr.attr,
 	&swpout_fallback_attr.attr,
 #endif
@@ -637,6 +639,7 @@ static struct attribute_group file_stats_attr_grp = {
 
 static struct attribute *any_stats_attrs[] = {
 #ifdef CONFIG_SHMEM
+	&zswpout_attr.attr,
 	&swpout_attr.attr,
 	&swpout_fallback_attr.attr,
 #endif
diff --git a/mm/page_io.c b/mm/page_io.c
index b6f1519d63b0..26106e745d73 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -289,6 +289,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		swap_zeromap_folio_clear(folio);
 	}
 	if (zswap_store(folio)) {
+		count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
 		folio_unlock(folio);
 		return 0;
 	}
@@ -308,8 +309,8 @@ static inline void count_swpout_vm_event(struct folio *folio)
 		count_memcg_folio_events(folio, THP_SWPOUT, 1);
 		count_vm_event(THP_SWPOUT);
 	}
-	count_mthp_stat(folio_order(folio), MTHP_STAT_SWPOUT);
 #endif
+	count_mthp_stat(folio_order(folio), MTHP_STAT_SWPOUT);
 	count_vm_events(PSWPOUT, folio_nr_pages(folio));
 }
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 3/3] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats.
  2024-08-29 21:27 ` [PATCH v6 3/3] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats Kanchana P Sridhar
@ 2024-08-30  0:19   ` Nhat Pham
  2024-09-20  2:32     ` Sridhar, Kanchana P
  2024-09-20 22:57   ` Yosry Ahmed
  1 sibling, 1 reply; 34+ messages in thread
From: Nhat Pham @ 2024-08-30  0:19 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou,
	wajdi.k.feghali, vinodh.gopal

On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Add a new MTHP_STAT_ZSWPOUT entry to the sysfs mTHP stats so that
> per-order mTHP folio ZSWAP stores can be accounted.

Can you update Documentation/admin-guide/mm/transhuge.rst?

1. New entry for zswpout.

2. Probably should clarify the semantics of swpout too - this does not
include zswap right?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v6 3/3] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats.
  2024-08-30  0:19   ` Nhat Pham
@ 2024-09-20  2:32     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 34+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-20  2:32 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosryahmed@google.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, Huang, Ying, 21cnbao@gmail.com,
	akpm@linux-foundation.org, Zou, Nanhai, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Thursday, August 29, 2024 5:20 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; Zou, Nanhai
> <nanhai.zou@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 3/3] mm: swap: Count successful mTHP ZSWAP stores
> in sysfs mTHP zswpout stats.
> 
> On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Add a new MTHP_STAT_ZSWPOUT entry to the sysfs mTHP stats so that
> > per-order mTHP folio ZSWAP stores can be accounted.
> 
> Can you update Documentation/admin-guide/mm/transhuge.rst?

Certainly, will do so!

> 
> 1. New entry for zswpout.
> 
> 2. Probably should clarify the semantics of swpout too - this does not
> include zswap right?

Sure. And yes, this does not include zswap.

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 3/3] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats.
  2024-08-29 21:27 ` [PATCH v6 3/3] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats Kanchana P Sridhar
  2024-08-30  0:19   ` Nhat Pham
@ 2024-09-20 22:57   ` Yosry Ahmed
  2024-09-20 23:28     ` Sridhar, Kanchana P
  1 sibling, 1 reply; 34+ messages in thread
From: Yosry Ahmed @ 2024-09-20 22:57 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou,
	wajdi.k.feghali, vinodh.gopal

On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Add a new MTHP_STAT_ZSWPOUT entry to the sysfs mTHP stats so that
> per-order mTHP folio ZSWAP stores can be accounted.
>
> If zswap_store() successfully swaps out an mTHP, it will be counted under
> the per-order sysfs "zswpout" stats:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>
> Other block dev/fs mTHP swap-out events will be counted under
> the existing sysfs "swpout" stats:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout
>
> Based on changes made in commit 61e751c01466ffef5dc72cb64349454a691c6bfe
> ("mm: cleanup count_mthp_stat() definition"), this patch also moves
> the call to count_mthp_stat() in count_swpout_vm_event() to be outside
> the "ifdef CONFIG_TRANSPARENT_HUGEPAGE".

This should be in a separate change, it's irrelevant to
MTHP_STAT_ZSWPOUT being added.

>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  include/linux/huge_mm.h | 1 +
>  mm/huge_memory.c        | 3 +++
>  mm/page_io.c            | 3 ++-
>  3 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 4da102b74a8c..8b690328e78b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -118,6 +118,7 @@ enum mthp_stat_item {
>         MTHP_STAT_ANON_FAULT_ALLOC,
>         MTHP_STAT_ANON_FAULT_FALLBACK,
>         MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
> +       MTHP_STAT_ZSWPOUT,
>         MTHP_STAT_SWPOUT,
>         MTHP_STAT_SWPOUT_FALLBACK,
>         MTHP_STAT_SHMEM_ALLOC,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 15418ffdd377..ad921c4b2ad8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -587,6 +587,7 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
>  DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
>  DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
>  DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> +DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
>  DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
>  DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK);
>  #ifdef CONFIG_SHMEM
> @@ -605,6 +606,7 @@ static struct attribute *anon_stats_attrs[] = {
>         &anon_fault_fallback_attr.attr,
>         &anon_fault_fallback_charge_attr.attr,
>  #ifndef CONFIG_SHMEM
> +       &zswpout_attr.attr,
>         &swpout_attr.attr,
>         &swpout_fallback_attr.attr,
>  #endif
> @@ -637,6 +639,7 @@ static struct attribute_group file_stats_attr_grp = {
>
>  static struct attribute *any_stats_attrs[] = {
>  #ifdef CONFIG_SHMEM
> +       &zswpout_attr.attr,
>         &swpout_attr.attr,
>         &swpout_fallback_attr.attr,
>  #endif
> diff --git a/mm/page_io.c b/mm/page_io.c
> index b6f1519d63b0..26106e745d73 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -289,6 +289,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>                 swap_zeromap_folio_clear(folio);
>         }
>         if (zswap_store(folio)) {
> +               count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
>                 folio_unlock(folio);
>                 return 0;
>         }
> @@ -308,8 +309,8 @@ static inline void count_swpout_vm_event(struct folio *folio)
>                 count_memcg_folio_events(folio, THP_SWPOUT, 1);
>                 count_vm_event(THP_SWPOUT);
>         }
> -       count_mthp_stat(folio_order(folio), MTHP_STAT_SWPOUT);
>  #endif
> +       count_mthp_stat(folio_order(folio), MTHP_STAT_SWPOUT);
>         count_vm_events(PSWPOUT, folio_nr_pages(folio));
>  }
>
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v6 3/3] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats.
  2024-09-20 22:57   ` Yosry Ahmed
@ 2024-09-20 23:28     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 34+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-20 23:28 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, nphamcs@gmail.com, chengming.zhou@linux.dev,
	usamaarif642@gmail.com, ryan.roberts@arm.com, Huang, Ying,
	21cnbao@gmail.com, akpm@linux-foundation.org, Zou, Nanhai,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Friday, September 20, 2024 3:58 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 3/3] mm: swap: Count successful mTHP ZSWAP stores
> in sysfs mTHP zswpout stats.
> 
> On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Add a new MTHP_STAT_ZSWPOUT entry to the sysfs mTHP stats so that
> > per-order mTHP folio ZSWAP stores can be accounted.
> >
> > If zswap_store() successfully swaps out an mTHP, it will be counted under
> > the per-order sysfs "zswpout" stats:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > Other block dev/fs mTHP swap-out events will be counted under
> > the existing sysfs "swpout" stats:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout
> >
> > Based on changes made in commit
> 61e751c01466ffef5dc72cb64349454a691c6bfe
> > ("mm: cleanup count_mthp_stat() definition"), this patch also moves
> > the call to count_mthp_stat() in count_swpout_vm_event() to be outside
> > the "ifdef CONFIG_TRANSPARENT_HUGEPAGE".
> 
> This should be in a separate change, it's irrelevant to
> MTHP_STAT_ZSWPOUT being added.

Sure. I will submit this as a separate change.

Thanks,
Kanchana

> 
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  include/linux/huge_mm.h | 1 +
> >  mm/huge_memory.c        | 3 +++
> >  mm/page_io.c            | 3 ++-
> >  3 files changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 4da102b74a8c..8b690328e78b 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -118,6 +118,7 @@ enum mthp_stat_item {
> >         MTHP_STAT_ANON_FAULT_ALLOC,
> >         MTHP_STAT_ANON_FAULT_FALLBACK,
> >         MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
> > +       MTHP_STAT_ZSWPOUT,
> >         MTHP_STAT_SWPOUT,
> >         MTHP_STAT_SWPOUT_FALLBACK,
> >         MTHP_STAT_SHMEM_ALLOC,
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 15418ffdd377..ad921c4b2ad8 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -587,6 +587,7 @@ static struct kobj_attribute _name##_attr =
> __ATTR_RO(_name)
> >  DEFINE_MTHP_STAT_ATTR(anon_fault_alloc,
> MTHP_STAT_ANON_FAULT_ALLOC);
> >  DEFINE_MTHP_STAT_ATTR(anon_fault_fallback,
> MTHP_STAT_ANON_FAULT_FALLBACK);
> >  DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge,
> MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> > +DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
> >  DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
> >  DEFINE_MTHP_STAT_ATTR(swpout_fallback,
> MTHP_STAT_SWPOUT_FALLBACK);
> >  #ifdef CONFIG_SHMEM
> > @@ -605,6 +606,7 @@ static struct attribute *anon_stats_attrs[] = {
> >         &anon_fault_fallback_attr.attr,
> >         &anon_fault_fallback_charge_attr.attr,
> >  #ifndef CONFIG_SHMEM
> > +       &zswpout_attr.attr,
> >         &swpout_attr.attr,
> >         &swpout_fallback_attr.attr,
> >  #endif
> > @@ -637,6 +639,7 @@ static struct attribute_group file_stats_attr_grp = {
> >
> >  static struct attribute *any_stats_attrs[] = {
> >  #ifdef CONFIG_SHMEM
> > +       &zswpout_attr.attr,
> >         &swpout_attr.attr,
> >         &swpout_fallback_attr.attr,
> >  #endif
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index b6f1519d63b0..26106e745d73 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -289,6 +289,7 @@ int swap_writepage(struct page *page, struct
> writeback_control *wbc)
> >                 swap_zeromap_folio_clear(folio);
> >         }
> >         if (zswap_store(folio)) {
> > +               count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
> >                 folio_unlock(folio);
> >                 return 0;
> >         }
> > @@ -308,8 +309,8 @@ static inline void count_swpout_vm_event(struct
> folio *folio)
> >                 count_memcg_folio_events(folio, THP_SWPOUT, 1);
> >                 count_vm_event(THP_SWPOUT);
> >         }
> > -       count_mthp_stat(folio_order(folio), MTHP_STAT_SWPOUT);
> >  #endif
> > +       count_mthp_stat(folio_order(folio), MTHP_STAT_SWPOUT);
> >         count_vm_events(PSWPOUT, folio_nr_pages(folio));
> >  }
> >
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-08-29 21:27 [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
                   ` (2 preceding siblings ...)
  2024-08-29 21:27 ` [PATCH v6 3/3] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats Kanchana P Sridhar
@ 2024-08-29 22:48 ` Yosry Ahmed
  2024-08-29 23:45   ` Nhat Pham
                     ` (2 more replies)
  2024-09-02 14:40 ` Usama Arif
  4 siblings, 3 replies; 34+ messages in thread
From: Yosry Ahmed @ 2024-08-29 22:48 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou,
	wajdi.k.feghali, vinodh.gopal

On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi All,
>
> This patch-series enables zswap_store() to accept and store mTHP
> folios. The most significant contribution in this series is from the
> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> migrated to v6.11-rc3 in patch 2/4 of this series.
>
> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
>      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Additionally, there is an attempt to modularize some of the functionality
> in zswap_store(), to make it more amenable to supporting any-order
> mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry
> in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> delete all offsets corresponding to a higher order folio stored in zswap.
>
> For accounting purposes, the patch-series adds per-order mTHP sysfs
> "zswpout" counters that get incremented upon successful zswap_store of
> an mTHP folio:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>
> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> will enable/disable zswap storing of (m)THP. When disabled, zswap will
> fallback to rejecting the mTHP folio, to be processed by the backing
> swap device.
>
> This patch-series is a precursor to ZSWAP compress batching of mTHP
> swap-out and decompress batching of swap-ins based on swapin_readahead(),
> using Intel IAA hardware acceleration, which we would like to submit in
> subsequent RFC patch-series, with performance improvement data.
>
> Thanks to Ying Huang for pre-posting review feedback and suggestions!
>
> Thanks also to Nhat, Yosry and Barry for their helpful feedback, data
> reviews and suggestions!
>
> Changes since v5:
> =================
> 1) Rebased to mm-unstable as of 8/29/2024,
>    commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
> 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
>    enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
>    suggestion to add a knob by which users can enable/disable this
>    change. Nhat, I hope this is along the lines of what you were
>    thinking.
> 3) Added vm-scalability usemem data with 4K folios with
>    CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure
>    there is no regression with this change.
> 4) Added data with usemem with 64K and 2M THP for an alternate view of
>    before/after, as suggested by Yosry, so we can understand the impact
>    of when mTHPs are split into 4K folios in shrink_folio_list()
>    (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
>    in zswap. Thanks Yosry for this suggestion.
>
> Changes since v4:
> =================
> 1) Published before/after data with zstd, as suggested by Nhat (Thanks
>    Nhat for the data reviews!).
> 2) Rebased to mm-unstable from 8/27/2024,
>    commit b659edec079c90012cf8d05624e312d1062b8b87.
> 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
>    CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
>    robot; as per Nhat's and Michal's suggestion to not require a separate
>    patch to fix the build errors (thanks both!).
> 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
>    suggested by Yosry (Thanks Yosry!).
> 5) Squashed the commits that define new mthp zswpout stat counters, and
>    invoke count_mthp_stat() after successful zswap_store()s; into a single
>    commit. Thanks Yosry for this suggestion!
>
> Changes since v3:
> =================
> 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
>    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
>    changes to count_mthp_stat() so that it's always defined, even when THP
>    is disabled. Barry, I have also made one other change in page_io.c
>    where count_mthp_stat() is called by count_swpout_vm_event(). I would
>    appreciate it if you can review this. Thanks!
>    Hopefully this should resolve the kernel robot build errors.
>
> Changes since v2:
> =================
> 1) Gathered usemem data using SSD as the backing swap device for zswap,
>    as suggested by Ying Huang. Ying, I would appreciate it if you can
>    review the latest data. Thanks!
> 2) Generated the base commit info in the patches to attempt to address
>    the kernel test robot build errors.
> 3) No code changes to the individual patches themselves.
>
> Changes since RFC v1:
> =====================
>
> 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
>    Thanks Barry!
> 2) Addressed some of the code review comments that Nhat Pham provided in
>    Ryan's initial RFC [1]:
>    - Added a comment about the cgroup zswap limit checks occuring once per
>      folio at the beginning of zswap_store().
>      Nhat, Ryan, please do let me know if the comments convey the summary
>      from the RFC discussion. Thanks!
>    - Posted data on running the cgroup suite's zswap kselftest.
> 3) Rebased to v6.11-rc3.
> 4) Gathered performance data with usemem and the rebased patch-series.
>
>
> Regression Testing:
> ===================
> I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
> folios with mm-unstable and with this patch-series. The main goal was
> to make sure that there is no functional or performance regression
> wrt the earlier zswap behavior for 4K folios,
> CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of 4K
> pages goes through the newly added code path [zswap_store(),
> zswap_store_page()].
>
> The data indicates there is no regression.
>
>  ------------------------------------------------------------------------------
>                      mm-unstable 8-28-2024                        zswap-mTHP v6
>                                               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
>                                                                      is not set
>  ------------------------------------------------------------------------------
>  ZSWAP compressor        zstd     deflate-                     zstd    deflate-
>                                        iaa                                  iaa
>  ------------------------------------------------------------------------------
>  Throughput (KB/s)    110,775      113,010               111,550        121,937
>  sys time (sec)      1,141.72       954.87              1,131.95         828.47
>  memcg_high           140,500      153,737               139,772        134,129
>  memcg_swap_high            0            0                     0              0
>  memcg_swap_fail            0            0                     0              0
>  pswpin                     0            0                     0              0
>  pswpout                    0            0                     0              0
>  zswpin                   675          690                   682            684
>  zswpout            9,552,298   10,603,271             9,566,392      9,267,213
>  thp_swpout                 0            0                     0              0
>  thp_swpout_                0            0                     0              0
>   fallback
>  pgmajfault             3,453        3,468                 3,841          3,487
>  ZSWPOUT-64kB-mTHP        n/a          n/a                     0              0
>  SWPOUT-64kB-mTHP           0            0                     0              0
>  ------------------------------------------------------------------------------
>
>
> Performance Testing:
> ====================
> Testing of this patch-series was done with the v6.11-rc3 mainline, without
> and with this patch-series, on an Intel Sapphire Rapids server,
> dual-socket 56 cores per socket, 4 IAA devices per socket.
>
> The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as the
> backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
> Core frequency was fixed at 2500MHz.
>
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 40G. The is no swap limit set for the cgroup. Following a
> similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
> series [2], 70 usemem processes were run, each allocating and writing 1G of
> memory:
>
>     usemem --init-time -w -O -n 70 1g
>
> The vm/sysfs mTHP stats included with the performance data provide details
> on the swapout activity to ZSWAP/swap.
>
> Other kernel configuration parameters:
>
>     ZSWAP Compressors : zstd, deflate-iaa
>     ZSWAP Allocator   : zsmalloc
>     SWAP page-cluster : 2
>
> In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> IAA "compression verification" is enabled. Hence each IAA compression
> will be decompressed internally by the "iaa_crypto" driver, the crc-s
> returned by the hardware will be compared and errors reported in case of
> mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> compared to the software compressors.
>
> Throughput is derived by averaging the individual 70 processes' throughputs
> reported by usemem. sys time is measured with perf. All data points are
> averaged across 3 runs.
>
> Case 1: Baseline with CONFIG_THP_SWAP turned off, and mTHP is split in reclaim.
> ===============================================================================
>
> In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
> 64K/2M (m)THP to be split, and only 4K folios processed by zswap.
>
> The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results
> in 64K/2M (m)THP to not be split, and processed by zswap.
>
>  64KB mTHP (cgroup memory.high set to 40G):
>  ==========================================
>
>  -------------------------------------------------------------------------------
>                        v6.11-rc3 mainline              zswap-mTHP     Change wrt
>                                  Baseline                               Baseline
>                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
>  -------------------------------------------------------------------------------
>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>                                       iaa                     iaa            iaa
>  -------------------------------------------------------------------------------
>  Throughput (KB/s)   136,113      140,044     140,363     151,938    3%       8%
>  sys time (sec)       986.78       951.95      954.85      735.47    3%      23%
>  memcg_high          124,183      127,513     138,651     133,884
>  memcg_swap_high           0            0           0           0
>  memcg_swap_fail     619,020      751,099           0           0
>  pswpin                    0            0           0           0
>  pswpout                   0            0           0           0
>  zswpin                  656          569         624         639
>  zswpout           9,413,603   11,284,812   9,453,761   9,385,910
>  thp_swpout                0            0           0           0
>  thp_swpout_               0            0           0           0
>   fallback
>  pgmajfault            3,470        3,382       4,633       3,611
>  ZSWPOUT-64kB            n/a          n/a     590,768     586,521
>  SWPOUT-64kB               0            0           0           0
>  -------------------------------------------------------------------------------
>
>
>  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
>  =======================================================
>
>  ------------------------------------------------------------------------------
>                        v6.11-rc3 mainline              zswap-mTHP    Change wrt
>                                  Baseline                              Baseline
>                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
>  ------------------------------------------------------------------------------
>  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
>                                      iaa                     iaa            iaa
>  ------------------------------------------------------------------------------
>  Throughput (KB/s)    164,220    172,523      165,005     174,536  0.5%      1%
>  sys time (sec)        855.76     686.94       801.72      676.65    6%      1%
>  memcg_high            14,628     16,247       14,951      16,096
>  memcg_swap_high            0          0            0           0
>  memcg_swap_fail       18,698     21,114            0           0
>  pswpin                     0          0            0           0
>  pswpout                    0          0            0           0
>  zswpin                   663        665        5,333         781
>  zswpout            8,419,458  8,992,065    8,546,895   9,355,760
>  thp_swpout                 0          0            0           0
>  thp_swpout_           18,697     21,113            0           0
>   fallback
>  pgmajfault             3,439      3,496        8,139       3,582
>  ZSWPOUT-2048kB           n/a        n/a       16,684      18,270
>  SWPOUT-2048kB              0          0            0           0
>  -----------------------------------------------------------------------------
>
> We see improvements overall in throughput and sys time for zstd and
> deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y).
>
>
> Case 2: Baseline with CONFIG_THP_SWAP enabled.
> ==============================================
>
> In this scenario, the "before" represents zswap rejecting mTHP, and the mTHP
> being stored by the backing swap device.
>
> The "after" represents data with this patch-series, that results in 64K/2M
> (m)THP being processed by zswap.
>
>  64KB mTHP (cgroup memory.high set to 40G):
>  ==========================================
>
>  ------------------------------------------------------------------------------
>                      v6.11-rc3 mainline              zswap-mTHP      Change wrt
>                                Baseline                                Baseline
>  ------------------------------------------------------------------------------
>  ZSWAP compressor       zstd   deflate-        zstd    deflate-   zstd deflate-
>                                     iaa                     iaa             iaa
>  ------------------------------------------------------------------------------
>  Throughput (KB/s)   161,496    156,343     140,363     151,938   -13%      -3%
>  sys time (sec)       771.68     802.08      954.85      735.47   -24%       8%
>  memcg_high          111,223    110,889     138,651     133,884
>  memcg_swap_high           0          0           0           0
>  memcg_swap_fail           0          0           0           0
>  pswpin                   16         16           0           0
>  pswpout           7,471,472  7,527,963           0           0
>  zswpin                  635        605         624         639
>  zswpout               1,509      1,478   9,453,761   9,385,910
>  thp_swpout                0          0           0           0
>  thp_swpout_               0          0           0           0
>   fallback
>  pgmajfault            3,616      3,430       4,633       3,611
>  ZSWPOUT-64kB            n/a        n/a     590,768     586,521
>  SWPOUT-64kB         466,967    470,498           0           0
>  ------------------------------------------------------------------------------
>
>  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
>  =======================================================
>
>  ------------------------------------------------------------------------------
>                       v6.11-rc3 mainline              zswap-mTHP     Change wrt
>                                 Baseline                               Baseline
>  ------------------------------------------------------------------------------
>  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
>                                      iaa                     iaa            iaa
>  ------------------------------------------------------------------------------
>  Throughput (KB/s)    192,164    194,643     165,005     174,536  -14%     -10%
>  sys time (sec)        823.55     830.42      801.72      676.65    3%      19%
>  memcg_high            16,054     15,936      14,951      16,096
>  memcg_swap_high            0          0           0           0
>  memcg_swap_fail            0          0           0           0
>  pswpin                     0          0           0           0
>  pswpout            8,629,248  8,628,907           0           0
>  zswpin                   560        645       5,333         781
>  zswpout                1,416      1,503   8,546,895   9,355,760
>  thp_swpout            16,854     16,853           0           0
>  thp_swpout_                0          0           0           0
>   fallback
>  pgmajfault             3,341      3,574       8,139       3,582
>  ZSWPOUT-2048kB           n/a        n/a      16,684      18,270
>  SWPOUT-2048kB         16,854     16,853           0           0
>  ------------------------------------------------------------------------------
>
> In the "Before" scenario, when zswap does not store mTHP, only allocations
> count towards the cgroup memory limit. However, in the "After" scenario,
> with the introduction of zswap_store() mTHP, both, allocations as well as
> the zswap compressed pool usage from all 70 processes are counted towards
> the memory limit. As a result, we see higher swapout activity in the
> "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> charge leads to more frequent memory.high breaches.
>
> This causes degradation in throughput and sys time with zswap mTHP, more so
> in case of zstd than deflate-iaa. Compress latency could play a part in
> this - when there is more swapout activity happening, a slower compressor
> would cause allocations to stall for any/all of the 70 processes.

We are basically comparing zram with zswap in this case, and it's not
fair because, as you mentioned, the zswap compressed data is being
accounted for while the zram compressed data isn't. I am not really
sure how valuable these test results are. Even if we remove the cgroup
accounting from zswap, we won't see an improvement, we should expect a
similar performance to zram.

I think the test results that are really valuable are case 1, where
zswap users are currently disabling CONFIG_THP_SWAP, and get to enable
it after this series.

If we really want to compare CONFIG_THP_SWAP on before and after, it
should be with SSD because that's a more conventional setup. In this
case the users that have CONFIG_THP_SWAP=y only experience the
benefits of zswap with this series. You mentioned experimenting with
usemem to keep the memory allocated longer so that you're able to have
a fair test with the small SSD swap setup. Did that work?

I am hoping Nhat or Johannes would shed some light on whether they
usually have CONFIG_THP_SWAP enabled or not with zswap. I am trying to
figure out if any reasonable setups enable CONFIG_THP_SWAP with zswap.
Otherwise the testing results from case 1 should be sufficient.

>
> In my opinion, even though the test set up does not provide an accurate
> way for a direct before/after comparison (because of zswap usage being
> counted in cgroup, hence towards the memory.high), it still seems
> reasonable for zswap_store to support (m)THP, so that further performance
> improvements can be implemented.

This is only referring to the results of case 2, right?

Honestly, I wouldn't want to merge mTHP swapout support on its own
just because it enables further performance improvements without
having actual patches for them. But I don't think this captures the
results accurately as it dismisses case 1 results (which I think are
more reasonable).

Thnaks


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-08-29 22:48 ` [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios Yosry Ahmed
@ 2024-08-29 23:45   ` Nhat Pham
  2024-08-29 23:54     ` Yosry Ahmed
  2024-09-20  2:16     ` Sridhar, Kanchana P
  2024-08-30  9:27   ` Huang, Ying
  2024-09-20  1:41   ` Sridhar, Kanchana P
  2 siblings, 2 replies; 34+ messages in thread
From: Nhat Pham @ 2024-08-29 23:45 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Thu, Aug 29, 2024 at 3:49 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
>
> We are basically comparing zram with zswap in this case, and it's not
> fair because, as you mentioned, the zswap compressed data is being
> accounted for while the zram compressed data isn't. I am not really
> sure how valuable these test results are. Even if we remove the cgroup
> accounting from zswap, we won't see an improvement, we should expect a
> similar performance to zram.
>
> I think the test results that are really valuable are case 1, where
> zswap users are currently disabling CONFIG_THP_SWAP, and get to enable
> it after this series.

Ah, this is a good point.

I think the point of comparing mTHP zswap v.s mTHP (SSD)swap is more
of a sanity check. IOW, if mTHP swap outperforms mTHP zswap, then
something is wrong (otherwise why would enable zswap - might as well
just use swap, since SSD swap with mTHP >>> zswap with mTHP >>> zswap
without mTHP).

That said, I don't think this benchmark can show it anyway. The access
pattern here is such that all the allocated memories are really cold,
so swap to disk (or to zram, which does not account memory usage
towards cgroup) is better by definition... And Kanchana does not seem
to have access to setup with larger SSD swapfiles? :)

>
> If we really want to compare CONFIG_THP_SWAP on before and after, it
> should be with SSD because that's a more conventional setup. In this
> case the users that have CONFIG_THP_SWAP=y only experience the
> benefits of zswap with this series. You mentioned experimenting with
> usemem to keep the memory allocated longer so that you're able to have
> a fair test with the small SSD swap setup. Did that work?
>
> I am hoping Nhat or Johannes would shed some light on whether they
> usually have CONFIG_THP_SWAP enabled or not with zswap. I am trying to
> figure out if any reasonable setups enable CONFIG_THP_SWAP with zswap.
> Otherwise the testing results from case 1 should be sufficient.
>
> >
> > In my opinion, even though the test set up does not provide an accurate
> > way for a direct before/after comparison (because of zswap usage being
> > counted in cgroup, hence towards the memory.high), it still seems
> > reasonable for zswap_store to support (m)THP, so that further performance
> > improvements can be implemented.
>
> This is only referring to the results of case 2, right?
>
> Honestly, I wouldn't want to merge mTHP swapout support on its own
> just because it enables further performance improvements without
> having actual patches for them. But I don't think this captures the
> results accurately as it dismisses case 1 results (which I think are
> more reasonable).
>
> Thnaks


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-08-29 23:45   ` Nhat Pham
@ 2024-08-29 23:54     ` Yosry Ahmed
  2024-08-30  0:06       ` Nhat Pham
  2024-09-20  2:22       ` Sridhar, Kanchana P
  2024-09-20  2:16     ` Sridhar, Kanchana P
  1 sibling, 2 replies; 34+ messages in thread
From: Yosry Ahmed @ 2024-08-29 23:54 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Thu, Aug 29, 2024 at 4:45 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, Aug 29, 2024 at 3:49 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
> >
> > We are basically comparing zram with zswap in this case, and it's not
> > fair because, as you mentioned, the zswap compressed data is being
> > accounted for while the zram compressed data isn't. I am not really
> > sure how valuable these test results are. Even if we remove the cgroup
> > accounting from zswap, we won't see an improvement, we should expect a
> > similar performance to zram.
> >
> > I think the test results that are really valuable are case 1, where
> > zswap users are currently disabling CONFIG_THP_SWAP, and get to enable
> > it after this series.
>
> Ah, this is a good point.
>
> I think the point of comparing mTHP zswap v.s mTHP (SSD)swap is more
> of a sanity check. IOW, if mTHP swap outperforms mTHP zswap, then
> something is wrong (otherwise why would enable zswap - might as well
> just use swap, since SSD swap with mTHP >>> zswap with mTHP >>> zswap
> without mTHP).

Yeah, good point, but as you mention below..

>
> That said, I don't think this benchmark can show it anyway. The access
> pattern here is such that all the allocated memories are really cold,
> so swap to disk (or to zram, which does not account memory usage
> towards cgroup) is better by definition... And Kanchana does not seem
> to have access to setup with larger SSD swapfiles? :)

I think it's also the fact that the processes exit right after they
are done allocating the memory. So I think in the case of SSD, when we
stall waiting for IO some processes get to exit and free up memory, so
we need to do less swapping out in general because the processes are
more serialized. With zswap, all processes try to access memory at the
same time so the required amount of memory at any given point is
higher, leading to more thrashing.

I suggested keeping the memory allocated for a long time to even the
playing field, or we can make the processes keep looping and accessing
the memory (or part of it) for a while.

That being said, I think this may be a signal that the memory.high
throttling is not performing as expected in the zswap case. Not sure
tbh, but I don't think SSD swap should perform better than zswap in
that case.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-08-29 23:54     ` Yosry Ahmed
@ 2024-08-30  0:06       ` Nhat Pham
  2024-08-30  0:14         ` Yosry Ahmed
  2024-09-20  2:26         ` Sridhar, Kanchana P
  2024-09-20  2:22       ` Sridhar, Kanchana P
  1 sibling, 2 replies; 34+ messages in thread
From: Nhat Pham @ 2024-08-30  0:06 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Thu, Aug 29, 2024 at 4:55 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Aug 29, 2024 at 4:45 PM Nhat Pham <nphamcs@gmail.com> wrote:
> I think it's also the fact that the processes exit right after they
> are done allocating the memory. So I think in the case of SSD, when we
> stall waiting for IO some processes get to exit and free up memory, so
> we need to do less swapping out in general because the processes are
> more serialized. With zswap, all processes try to access memory at the
> same time so the required amount of memory at any given point is
> higher, leading to more thrashing.
>
> I suggested keeping the memory allocated for a long time to even the
> playing field, or we can make the processes keep looping and accessing
> the memory (or part of it) for a while.
>
> That being said, I think this may be a signal that the memory.high
> throttling is not performing as expected in the zswap case. Not sure
> tbh, but I don't think SSD swap should perform better than zswap in
> that case.

Yeah something is fishy there. That said, the benchmarking in v4 is wack:

1. We use lz4, which has a really poor compression factor.

2. The swapfile is really small, so we occasionally see problems with
swap allocation failure.

Both of these factors affect benchmarking validity and stability a
lot. I think in this version's benchmarks, with zstd as the software
compressor + a much larger swapfile (albeit on top of a ZRAM block
device), we no longer see memory.high violation, even at a lower
memory.high value...? The performance number is wack indeed - not a
lot of values in the case 2 section.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-08-30  0:06       ` Nhat Pham
@ 2024-08-30  0:14         ` Yosry Ahmed
  2024-09-20  2:30           ` Sridhar, Kanchana P
  2024-09-20  2:26         ` Sridhar, Kanchana P
  1 sibling, 1 reply; 34+ messages in thread
From: Yosry Ahmed @ 2024-08-30  0:14 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Thu, Aug 29, 2024 at 5:06 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, Aug 29, 2024 at 4:55 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Aug 29, 2024 at 4:45 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > I think it's also the fact that the processes exit right after they
> > are done allocating the memory. So I think in the case of SSD, when we
> > stall waiting for IO some processes get to exit and free up memory, so
> > we need to do less swapping out in general because the processes are
> > more serialized. With zswap, all processes try to access memory at the
> > same time so the required amount of memory at any given point is
> > higher, leading to more thrashing.
> >
> > I suggested keeping the memory allocated for a long time to even the
> > playing field, or we can make the processes keep looping and accessing
> > the memory (or part of it) for a while.
> >
> > That being said, I think this may be a signal that the memory.high
> > throttling is not performing as expected in the zswap case. Not sure
> > tbh, but I don't think SSD swap should perform better than zswap in
> > that case.
>
> Yeah something is fishy there. That said, the benchmarking in v4 is wack:
>
> 1. We use lz4, which has a really poor compression factor.
>
> 2. The swapfile is really small, so we occasionally see problems with
> swap allocation failure.
>
> Both of these factors affect benchmarking validity and stability a
> lot. I think in this version's benchmarks, with zstd as the software
> compressor + a much larger swapfile (albeit on top of a ZRAM block
> device), we no longer see memory.high violation, even at a lower
> memory.high value...? The performance number is wack indeed - not a
> lot of values in the case 2 section.

But when we use zram we are essentially comparing two swap mechanisms
compressing mTHPs page by page, with the only difference being that
zram does not account the memory. For this to have any value imo it
should be on an SSD to at least provide the value of being a practical
sanity check as you mentioned earlier. In its current form I don't
think it's providing any value.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-08-30  0:14         ` Yosry Ahmed
@ 2024-09-20  2:30           ` Sridhar, Kanchana P
  0 siblings, 0 replies; 34+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-20  2:30 UTC (permalink / raw)
  To: Yosry Ahmed, Nhat Pham
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, chengming.zhou@linux.dev,
	usamaarif642@gmail.com, ryan.roberts@arm.com, Huang, Ying,
	21cnbao@gmail.com, akpm@linux-foundation.org, Zou, Nanhai,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Thursday, August 29, 2024 5:14 PM
> To: Nhat Pham <nphamcs@gmail.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; Zou, Nanhai
> <nanhai.zou@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
> 
> On Thu, Aug 29, 2024 at 5:06 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Thu, Aug 29, 2024 at 4:55 PM Yosry Ahmed <yosryahmed@google.com>
> wrote:
> > >
> > > On Thu, Aug 29, 2024 at 4:45 PM Nhat Pham <nphamcs@gmail.com>
> wrote:
> > > I think it's also the fact that the processes exit right after they
> > > are done allocating the memory. So I think in the case of SSD, when we
> > > stall waiting for IO some processes get to exit and free up memory, so
> > > we need to do less swapping out in general because the processes are
> > > more serialized. With zswap, all processes try to access memory at the
> > > same time so the required amount of memory at any given point is
> > > higher, leading to more thrashing.
> > >
> > > I suggested keeping the memory allocated for a long time to even the
> > > playing field, or we can make the processes keep looping and accessing
> > > the memory (or part of it) for a while.
> > >
> > > That being said, I think this may be a signal that the memory.high
> > > throttling is not performing as expected in the zswap case. Not sure
> > > tbh, but I don't think SSD swap should perform better than zswap in
> > > that case.
> >
> > Yeah something is fishy there. That said, the benchmarking in v4 is wack:
> >
> > 1. We use lz4, which has a really poor compression factor.
> >
> > 2. The swapfile is really small, so we occasionally see problems with
> > swap allocation failure.
> >
> > Both of these factors affect benchmarking validity and stability a
> > lot. I think in this version's benchmarks, with zstd as the software
> > compressor + a much larger swapfile (albeit on top of a ZRAM block
> > device), we no longer see memory.high violation, even at a lower
> > memory.high value...? The performance number is wack indeed - not a
> > lot of values in the case 2 section.
> 
> But when we use zram we are essentially comparing two swap mechanisms
> compressing mTHPs page by page, with the only difference being that
> zram does not account the memory. For this to have any value imo it
> should be on an SSD to at least provide the value of being a practical
> sanity check as you mentioned earlier. In its current form I don't
> think it's providing any value.

Just posted data today with SSD and longer running usemem processes,
that should hopefully better quantify the benefit of zswap-mTHP.

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-08-30  0:06       ` Nhat Pham
  2024-08-30  0:14         ` Yosry Ahmed
@ 2024-09-20  2:26         ` Sridhar, Kanchana P
  1 sibling, 0 replies; 34+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-20  2:26 UTC (permalink / raw)
  To: Nhat Pham, Yosry Ahmed
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, chengming.zhou@linux.dev,
	usamaarif642@gmail.com, ryan.roberts@arm.com, Huang, Ying,
	21cnbao@gmail.com, akpm@linux-foundation.org, Zou, Nanhai,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Thursday, August 29, 2024 5:07 PM
> To: Yosry Ahmed <yosryahmed@google.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; Zou, Nanhai
> <nanhai.zou@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
> 
> On Thu, Aug 29, 2024 at 4:55 PM Yosry Ahmed <yosryahmed@google.com>
> wrote:
> >
> > On Thu, Aug 29, 2024 at 4:45 PM Nhat Pham <nphamcs@gmail.com>
> wrote:
> > I think it's also the fact that the processes exit right after they
> > are done allocating the memory. So I think in the case of SSD, when we
> > stall waiting for IO some processes get to exit and free up memory, so
> > we need to do less swapping out in general because the processes are
> > more serialized. With zswap, all processes try to access memory at the
> > same time so the required amount of memory at any given point is
> > higher, leading to more thrashing.
> >
> > I suggested keeping the memory allocated for a long time to even the
> > playing field, or we can make the processes keep looping and accessing
> > the memory (or part of it) for a while.
> >
> > That being said, I think this may be a signal that the memory.high
> > throttling is not performing as expected in the zswap case. Not sure
> > tbh, but I don't think SSD swap should perform better than zswap in
> > that case.
> 
> Yeah something is fishy there. That said, the benchmarking in v4 is wack:
> 
> 1. We use lz4, which has a really poor compression factor.
> 
> 2. The swapfile is really small, so we occasionally see problems with
> swap allocation failure.
> 
> Both of these factors affect benchmarking validity and stability a
> lot. I think in this version's benchmarks, with zstd as the software
> compressor + a much larger swapfile (albeit on top of a ZRAM block
> device), we no longer see memory.high violation, even at a lower
> memory.high value...? The performance number is wack indeed - not a
> lot of values in the case 2 section.

Hopefully the latest data from the two sets of experiments (4G SSD with
usemem --sleep 10, and 179G SSD) should make better sense?

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-08-29 23:54     ` Yosry Ahmed
  2024-08-30  0:06       ` Nhat Pham
@ 2024-09-20  2:22       ` Sridhar, Kanchana P
  1 sibling, 0 replies; 34+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-20  2:22 UTC (permalink / raw)
  To: Yosry Ahmed, Nhat Pham
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, chengming.zhou@linux.dev,
	usamaarif642@gmail.com, ryan.roberts@arm.com, Huang, Ying,
	21cnbao@gmail.com, akpm@linux-foundation.org, Zou, Nanhai,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Thursday, August 29, 2024 4:55 PM
> To: Nhat Pham <nphamcs@gmail.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; Zou, Nanhai
> <nanhai.zou@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
> 
> On Thu, Aug 29, 2024 at 4:45 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Thu, Aug 29, 2024 at 3:49 PM Yosry Ahmed <yosryahmed@google.com>
> wrote:
> > >
> > > On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
> > >
> > > We are basically comparing zram with zswap in this case, and it's not
> > > fair because, as you mentioned, the zswap compressed data is being
> > > accounted for while the zram compressed data isn't. I am not really
> > > sure how valuable these test results are. Even if we remove the cgroup
> > > accounting from zswap, we won't see an improvement, we should expect
> a
> > > similar performance to zram.
> > >
> > > I think the test results that are really valuable are case 1, where
> > > zswap users are currently disabling CONFIG_THP_SWAP, and get to enable
> > > it after this series.
> >
> > Ah, this is a good point.
> >
> > I think the point of comparing mTHP zswap v.s mTHP (SSD)swap is more
> > of a sanity check. IOW, if mTHP swap outperforms mTHP zswap, then
> > something is wrong (otherwise why would enable zswap - might as well
> > just use swap, since SSD swap with mTHP >>> zswap with mTHP >>> zswap
> > without mTHP).
> 
> Yeah, good point, but as you mention below..
> 
> >
> > That said, I don't think this benchmark can show it anyway. The access
> > pattern here is such that all the allocated memories are really cold,
> > so swap to disk (or to zram, which does not account memory usage
> > towards cgroup) is better by definition... And Kanchana does not seem
> > to have access to setup with larger SSD swapfiles? :)
> 
> I think it's also the fact that the processes exit right after they
> are done allocating the memory. So I think in the case of SSD, when we
> stall waiting for IO some processes get to exit and free up memory, so
> we need to do less swapping out in general because the processes are
> more serialized. With zswap, all processes try to access memory at the
> same time so the required amount of memory at any given point is
> higher, leading to more thrashing.
> 
> I suggested keeping the memory allocated for a long time to even the
> playing field, or we can make the processes keep looping and accessing
> the memory (or part of it) for a while.

Thanks for the suggestion, Yosry. I have shared the data in my earlier
response today, that seems to confirm your hypothesis. Please do let
me know if you have any other suggestions.

We generally see better throughput of usemem with zswap-mTHP
as compared to SSD-mTHP.

Thanks,
Kanchana

> 
> That being said, I think this may be a signal that the memory.high
> throttling is not performing as expected in the zswap case. Not sure
> tbh, but I don't think SSD swap should perform better than zswap in
> that case.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-08-29 23:45   ` Nhat Pham
  2024-08-29 23:54     ` Yosry Ahmed
@ 2024-09-20  2:16     ` Sridhar, Kanchana P
  2024-09-20  9:12       ` Huang, Ying
  1 sibling, 1 reply; 34+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-20  2:16 UTC (permalink / raw)
  To: Nhat Pham, Yosry Ahmed
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, chengming.zhou@linux.dev,
	usamaarif642@gmail.com, ryan.roberts@arm.com, Huang, Ying,
	21cnbao@gmail.com, akpm@linux-foundation.org, Zou, Nanhai,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

Hi Nhat,

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Thursday, August 29, 2024 4:46 PM
> To: Yosry Ahmed <yosryahmed@google.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; Zou, Nanhai
> <nanhai.zou@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
> 
> On Thu, Aug 29, 2024 at 3:49 PM Yosry Ahmed <yosryahmed@google.com>
> wrote:
> >
> > On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
> >
> > We are basically comparing zram with zswap in this case, and it's not
> > fair because, as you mentioned, the zswap compressed data is being
> > accounted for while the zram compressed data isn't. I am not really
> > sure how valuable these test results are. Even if we remove the cgroup
> > accounting from zswap, we won't see an improvement, we should expect a
> > similar performance to zram.
> >
> > I think the test results that are really valuable are case 1, where
> > zswap users are currently disabling CONFIG_THP_SWAP, and get to enable
> > it after this series.
> 
> Ah, this is a good point.
> 
> I think the point of comparing mTHP zswap v.s mTHP (SSD)swap is more
> of a sanity check. IOW, if mTHP swap outperforms mTHP zswap, then
> something is wrong (otherwise why would enable zswap - might as well
> just use swap, since SSD swap with mTHP >>> zswap with mTHP >>> zswap
> without mTHP).
> 
> That said, I don't think this benchmark can show it anyway. The access
> pattern here is such that all the allocated memories are really cold,
> so swap to disk (or to zram, which does not account memory usage
> towards cgroup) is better by definition... And Kanchana does not seem
> to have access to setup with larger SSD swapfiles? :)

As follow up, I created a swapfile on disk to increase the SSD swap to 179G.

 64KB mTHP (cgroup memory.high set to 40G, no swap limit):
 =========================================================
 CONFIG_THP_SWAP=Y
 Sapphire Rapids server with 503 GiB RAM and 179G SSD swap backing device
 for zswap.

 usemem --init-time -w -O --sleep 0 -n 70 1g:

 -------------------------------------------------------------------------------
                    mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
                                 Baseline                               Baseline
                                 "before"                 "after"      (sleep 0)
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)    93,273       88,496     143,117     134,131    53%     52%
 sys time (sec)       316.68       349.00      917.88      877.74  -190%   -152%
 memcg_high           73,836       83,522     126,120     133,013
 memcg_swap_fail     261,136      324,533     494,191     578,824
 pswpin                   16           11           0           0
 pswpout           1,242,187    1,263,493           0           0
 zswpin                  694          668         712         702
 zswpout           3,991,403    4,933,901   9,289,092  10,461,948
 thp_swpout                0            0           0           0
 thp_swpout_               0            0           0           0
  fallback
 pgmajfault            3,488        3,353       3,377       3,499
 ZSWPOUT-64kB            n/a          n/a     110,067     103,957
 SWPOUT-64kB          77,637       78,968           0           0
 -------------------------------------------------------------------------------

We do see 50% throughput improvement with mTHP-zswap wrt mTHP-SSD.
The sys time increase can be attributed to higher swapout activity
occurring with zswap-mTHP.

I hope this quantifies the benefit of mTHP-zswap wrt mTHP-SSD in a
non-swap-constrained setup. The 4G SSD swap setup data I shared
in my response to Yosry also indicates better throughput with mTHP-zswap
as compared to mTHP-SSD.

Please do let me know if you have any other questions/suggestions.

Thanks,
Kanchana

> 
> >
> > If we really want to compare CONFIG_THP_SWAP on before and after, it
> > should be with SSD because that's a more conventional setup. In this
> > case the users that have CONFIG_THP_SWAP=y only experience the
> > benefits of zswap with this series. You mentioned experimenting with
> > usemem to keep the memory allocated longer so that you're able to have
> > a fair test with the small SSD swap setup. Did that work?
> >
> > I am hoping Nhat or Johannes would shed some light on whether they
> > usually have CONFIG_THP_SWAP enabled or not with zswap. I am trying to
> > figure out if any reasonable setups enable CONFIG_THP_SWAP with zswap.
> > Otherwise the testing results from case 1 should be sufficient.
> >
> > >
> > > In my opinion, even though the test set up does not provide an accurate
> > > way for a direct before/after comparison (because of zswap usage being
> > > counted in cgroup, hence towards the memory.high), it still seems
> > > reasonable for zswap_store to support (m)THP, so that further
> performance
> > > improvements can be implemented.
> >
> > This is only referring to the results of case 2, right?
> >
> > Honestly, I wouldn't want to merge mTHP swapout support on its own
> > just because it enables further performance improvements without
> > having actual patches for them. But I don't think this captures the
> > results accurately as it dismisses case 1 results (which I think are
> > more reasonable).
> >
> > Thnaks

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-09-20  2:16     ` Sridhar, Kanchana P
@ 2024-09-20  9:12       ` Huang, Ying
  2024-09-20 16:53         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 34+ messages in thread
From: Huang, Ying @ 2024-09-20  9:12 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Nhat Pham, Yosry Ahmed, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, hannes@cmpxchg.org, chengming.zhou@linux.dev,
	usamaarif642@gmail.com, ryan.roberts@arm.com, 21cnbao@gmail.com,
	akpm@linux-foundation.org, Zou, Nanhai, Feghali, Wajdi K,
	Gopal, Vinodh

"Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:

> Hi Nhat,
>
>> -----Original Message-----
>> From: Nhat Pham <nphamcs@gmail.com>
>> Sent: Thursday, August 29, 2024 4:46 PM
>> To: Yosry Ahmed <yosryahmed@google.com>
>> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
>> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
>> chengming.zhou@linux.dev; usamaarif642@gmail.com;
>> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
>> 21cnbao@gmail.com; akpm@linux-foundation.org; Zou, Nanhai
>> <nanhai.zou@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
>> Gopal, Vinodh <vinodh.gopal@intel.com>
>> Subject: Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
>> 
>> On Thu, Aug 29, 2024 at 3:49 PM Yosry Ahmed <yosryahmed@google.com>
>> wrote:
>> >
>> > On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
>> >
>> > We are basically comparing zram with zswap in this case, and it's not
>> > fair because, as you mentioned, the zswap compressed data is being
>> > accounted for while the zram compressed data isn't. I am not really
>> > sure how valuable these test results are. Even if we remove the cgroup
>> > accounting from zswap, we won't see an improvement, we should expect a
>> > similar performance to zram.
>> >
>> > I think the test results that are really valuable are case 1, where
>> > zswap users are currently disabling CONFIG_THP_SWAP, and get to enable
>> > it after this series.
>> 
>> Ah, this is a good point.
>> 
>> I think the point of comparing mTHP zswap v.s mTHP (SSD)swap is more
>> of a sanity check. IOW, if mTHP swap outperforms mTHP zswap, then
>> something is wrong (otherwise why would enable zswap - might as well
>> just use swap, since SSD swap with mTHP >>> zswap with mTHP >>> zswap
>> without mTHP).
>> 
>> That said, I don't think this benchmark can show it anyway. The access
>> pattern here is such that all the allocated memories are really cold,
>> so swap to disk (or to zram, which does not account memory usage
>> towards cgroup) is better by definition... And Kanchana does not seem
>> to have access to setup with larger SSD swapfiles? :)
>
> As follow up, I created a swapfile on disk to increase the SSD swap to 179G.

Are you sure you used swapfile instead of a swap partition?  From the
following code in scan_swap_map_slots(),

	if (order > 0) {
		/*
		 * Should not even be attempting large allocations when huge
		 * page swap is disabled.  Warn and fail the allocation.
		 */
		if (!IS_ENABLED(CONFIG_THP_SWAP) ||
		    nr_pages > SWAPFILE_CLUSTER) {
			VM_WARN_ON_ONCE(1);
			return 0;
		}

		/*
		 * Swapfile is not block device or not using clusters so unable
		 * to allocate large entries.
		 */
		if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
			return 0;
	}

large folio will be split for swapfile.

--
Best Regards,
Huang, Ying

>  64KB mTHP (cgroup memory.high set to 40G, no swap limit):
>  =========================================================
>  CONFIG_THP_SWAP=Y
>  Sapphire Rapids server with 503 GiB RAM and 179G SSD swap backing device
>  for zswap.
>
>  usemem --init-time -w -O --sleep 0 -n 70 1g:
>
>  -------------------------------------------------------------------------------
>                     mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
>                                  Baseline                               Baseline
>                                  "before"                 "after"      (sleep 0)
>  -------------------------------------------------------------------------------
>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>                                       iaa                     iaa            iaa
>  -------------------------------------------------------------------------------
>  Throughput (KB/s)    93,273       88,496     143,117     134,131    53%     52%
>  sys time (sec)       316.68       349.00      917.88      877.74  -190%   -152%
>  memcg_high           73,836       83,522     126,120     133,013
>  memcg_swap_fail     261,136      324,533     494,191     578,824
>  pswpin                   16           11           0           0
>  pswpout           1,242,187    1,263,493           0           0
>  zswpin                  694          668         712         702
>  zswpout           3,991,403    4,933,901   9,289,092  10,461,948
>  thp_swpout                0            0           0           0
>  thp_swpout_               0            0           0           0
>   fallback
>  pgmajfault            3,488        3,353       3,377       3,499
>  ZSWPOUT-64kB            n/a          n/a     110,067     103,957
>  SWPOUT-64kB          77,637       78,968           0           0
>  -------------------------------------------------------------------------------
>
> We do see 50% throughput improvement with mTHP-zswap wrt mTHP-SSD.
> The sys time increase can be attributed to higher swapout activity
> occurring with zswap-mTHP.
>
> I hope this quantifies the benefit of mTHP-zswap wrt mTHP-SSD in a
> non-swap-constrained setup. The 4G SSD swap setup data I shared
> in my response to Yosry also indicates better throughput with mTHP-zswap
> as compared to mTHP-SSD.
>
> Please do let me know if you have any other questions/suggestions.
>
> Thanks,
> Kanchana
>
>> 
>> >
>> > If we really want to compare CONFIG_THP_SWAP on before and after, it
>> > should be with SSD because that's a more conventional setup. In this
>> > case the users that have CONFIG_THP_SWAP=y only experience the
>> > benefits of zswap with this series. You mentioned experimenting with
>> > usemem to keep the memory allocated longer so that you're able to have
>> > a fair test with the small SSD swap setup. Did that work?
>> >
>> > I am hoping Nhat or Johannes would shed some light on whether they
>> > usually have CONFIG_THP_SWAP enabled or not with zswap. I am trying to
>> > figure out if any reasonable setups enable CONFIG_THP_SWAP with zswap.
>> > Otherwise the testing results from case 1 should be sufficient.
>> >
>> > >
>> > > In my opinion, even though the test set up does not provide an accurate
>> > > way for a direct before/after comparison (because of zswap usage being
>> > > counted in cgroup, hence towards the memory.high), it still seems
>> > > reasonable for zswap_store to support (m)THP, so that further
>> performance
>> > > improvements can be implemented.
>> >
>> > This is only referring to the results of case 2, right?
>> >
>> > Honestly, I wouldn't want to merge mTHP swapout support on its own
>> > just because it enables further performance improvements without
>> > having actual patches for them. But I don't think this captures the
>> > results accurately as it dismisses case 1 results (which I think are
>> > more reasonable).
>> >
>> > Thnaks


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-09-20  9:12       ` Huang, Ying
@ 2024-09-20 16:53         ` Sridhar, Kanchana P
  0 siblings, 0 replies; 34+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-20 16:53 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Nhat Pham, Yosry Ahmed, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, hannes@cmpxchg.org, chengming.zhou@linux.dev,
	usamaarif642@gmail.com, ryan.roberts@arm.com, 21cnbao@gmail.com,
	akpm@linux-foundation.org, Zou, Nanhai, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P

> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Friday, September 20, 2024 2:12 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>; Yosry Ahmed
> <yosryahmed@google.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali,
> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
> 
> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:
> 
> > Hi Nhat,
> >
> >> -----Original Message-----
> >> From: Nhat Pham <nphamcs@gmail.com>
> >> Sent: Thursday, August 29, 2024 4:46 PM
> >> To: Yosry Ahmed <yosryahmed@google.com>
> >> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> >> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> >> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> >> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> >> 21cnbao@gmail.com; akpm@linux-foundation.org; Zou, Nanhai
> >> <nanhai.zou@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> >> Gopal, Vinodh <vinodh.gopal@intel.com>
> >> Subject: Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
> >>
> >> On Thu, Aug 29, 2024 at 3:49 PM Yosry Ahmed
> <yosryahmed@google.com>
> >> wrote:
> >> >
> >> > On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
> >> >
> >> > We are basically comparing zram with zswap in this case, and it's not
> >> > fair because, as you mentioned, the zswap compressed data is being
> >> > accounted for while the zram compressed data isn't. I am not really
> >> > sure how valuable these test results are. Even if we remove the cgroup
> >> > accounting from zswap, we won't see an improvement, we should
> expect a
> >> > similar performance to zram.
> >> >
> >> > I think the test results that are really valuable are case 1, where
> >> > zswap users are currently disabling CONFIG_THP_SWAP, and get to
> enable
> >> > it after this series.
> >>
> >> Ah, this is a good point.
> >>
> >> I think the point of comparing mTHP zswap v.s mTHP (SSD)swap is more
> >> of a sanity check. IOW, if mTHP swap outperforms mTHP zswap, then
> >> something is wrong (otherwise why would enable zswap - might as well
> >> just use swap, since SSD swap with mTHP >>> zswap with mTHP >>>
> zswap
> >> without mTHP).
> >>
> >> That said, I don't think this benchmark can show it anyway. The access
> >> pattern here is such that all the allocated memories are really cold,
> >> so swap to disk (or to zram, which does not account memory usage
> >> towards cgroup) is better by definition... And Kanchana does not seem
> >> to have access to setup with larger SSD swapfiles? :)
> >
> > As follow up, I created a swapfile on disk to increase the SSD swap to 179G.
> 
> Are you sure you used swapfile instead of a swap partition?  From the
> following code in scan_swap_map_slots(),
> 
> 	if (order > 0) {
> 		/*
> 		 * Should not even be attempting large allocations when huge
> 		 * page swap is disabled.  Warn and fail the allocation.
> 		 */
> 		if (!IS_ENABLED(CONFIG_THP_SWAP) ||
> 		    nr_pages > SWAPFILE_CLUSTER) {
> 			VM_WARN_ON_ONCE(1);
> 			return 0;
> 		}
> 
> 		/*
> 		 * Swapfile is not block device or not using clusters so unable
> 		 * to allocate large entries.
> 		 */
> 		if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
> 			return 0;
> 	}
> 
> large folio will be split for swapfile.

I see. Thanks for this clarification. No, this is a configuration with
175G swapfile on disk + 4G SSD. Large folios being split for swapfile
probably explains the memcg_swap_fail counts in this case.

Thanks,
Kanchana

> 
> --
> Best Regards,
> Huang, Ying
> 
> >  64KB mTHP (cgroup memory.high set to 40G, no swap limit):
> >  =========================================================
> >  CONFIG_THP_SWAP=Y
> >  Sapphire Rapids server with 503 GiB RAM and 179G SSD swap backing
> device
> >  for zswap.
> >
> >  usemem --init-time -w -O --sleep 0 -n 70 1g:
> >
> >  -------------------------------------------------------------------------------
> >                     mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
> >                                  Baseline                               Baseline
> >                                  "before"                 "after"      (sleep 0)
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)    93,273       88,496     143,117     134,131    53%     52%
> >  sys time (sec)       316.68       349.00      917.88      877.74  -190%   -152%
> >  memcg_high           73,836       83,522     126,120     133,013
> >  memcg_swap_fail     261,136      324,533     494,191     578,824
> >  pswpin                   16           11           0           0
> >  pswpout           1,242,187    1,263,493           0           0
> >  zswpin                  694          668         712         702
> >  zswpout           3,991,403    4,933,901   9,289,092  10,461,948
> >  thp_swpout                0            0           0           0
> >  thp_swpout_               0            0           0           0
> >   fallback
> >  pgmajfault            3,488        3,353       3,377       3,499
> >  ZSWPOUT-64kB            n/a          n/a     110,067     103,957
> >  SWPOUT-64kB          77,637       78,968           0           0
> >  -------------------------------------------------------------------------------
> >
> > We do see 50% throughput improvement with mTHP-zswap wrt mTHP-SSD.
> > The sys time increase can be attributed to higher swapout activity
> > occurring with zswap-mTHP.
> >
> > I hope this quantifies the benefit of mTHP-zswap wrt mTHP-SSD in a
> > non-swap-constrained setup. The 4G SSD swap setup data I shared
> > in my response to Yosry also indicates better throughput with mTHP-zswap
> > as compared to mTHP-SSD.
> >
> > Please do let me know if you have any other questions/suggestions.
> >
> > Thanks,
> > Kanchana
> >
> >>
> >> >
> >> > If we really want to compare CONFIG_THP_SWAP on before and after, it
> >> > should be with SSD because that's a more conventional setup. In this
> >> > case the users that have CONFIG_THP_SWAP=y only experience the
> >> > benefits of zswap with this series. You mentioned experimenting with
> >> > usemem to keep the memory allocated longer so that you're able to have
> >> > a fair test with the small SSD swap setup. Did that work?
> >> >
> >> > I am hoping Nhat or Johannes would shed some light on whether they
> >> > usually have CONFIG_THP_SWAP enabled or not with zswap. I am trying
> to
> >> > figure out if any reasonable setups enable CONFIG_THP_SWAP with
> zswap.
> >> > Otherwise the testing results from case 1 should be sufficient.
> >> >
> >> > >
> >> > > In my opinion, even though the test set up does not provide an
> accurate
> >> > > way for a direct before/after comparison (because of zswap usage
> being
> >> > > counted in cgroup, hence towards the memory.high), it still seems
> >> > > reasonable for zswap_store to support (m)THP, so that further
> >> performance
> >> > > improvements can be implemented.
> >> >
> >> > This is only referring to the results of case 2, right?
> >> >
> >> > Honestly, I wouldn't want to merge mTHP swapout support on its own
> >> > just because it enables further performance improvements without
> >> > having actual patches for them. But I don't think this captures the
> >> > results accurately as it dismisses case 1 results (which I think are
> >> > more reasonable).
> >> >
> >> > Thnaks

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-08-29 22:48 ` [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios Yosry Ahmed
  2024-08-29 23:45   ` Nhat Pham
@ 2024-08-30  9:27   ` Huang, Ying
  2024-09-20  2:41     ` Sridhar, Kanchana P
  2024-09-20  1:41   ` Sridhar, Kanchana P
  2 siblings, 1 reply; 34+ messages in thread
From: Huang, Ying @ 2024-08-30  9:27 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	nanhai.zou, wajdi.k.feghali, vinodh.gopal

Yosry Ahmed <yosryahmed@google.com> writes:

> On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
>>
>> Hi All,
>>
>> This patch-series enables zswap_store() to accept and store mTHP
>> folios. The most significant contribution in this series is from the
>> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
>> migrated to v6.11-rc3 in patch 2/4 of this series.
>>
>> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
>>      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>>
>> Additionally, there is an attempt to modularize some of the functionality
>> in zswap_store(), to make it more amenable to supporting any-order
>> mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry
>> in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
>> delete all offsets corresponding to a higher order folio stored in zswap.
>>
>> For accounting purposes, the patch-series adds per-order mTHP sysfs
>> "zswpout" counters that get incremented upon successful zswap_store of
>> an mTHP folio:
>>
>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>>
>> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
>> will enable/disable zswap storing of (m)THP. When disabled, zswap will
>> fallback to rejecting the mTHP folio, to be processed by the backing
>> swap device.
>>
>> This patch-series is a precursor to ZSWAP compress batching of mTHP
>> swap-out and decompress batching of swap-ins based on swapin_readahead(),
>> using Intel IAA hardware acceleration, which we would like to submit in
>> subsequent RFC patch-series, with performance improvement data.
>>
>> Thanks to Ying Huang for pre-posting review feedback and suggestions!
>>
>> Thanks also to Nhat, Yosry and Barry for their helpful feedback, data
>> reviews and suggestions!
>>
>> Changes since v5:
>> =================
>> 1) Rebased to mm-unstable as of 8/29/2024,
>>    commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
>> 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
>>    enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
>>    suggestion to add a knob by which users can enable/disable this
>>    change. Nhat, I hope this is along the lines of what you were
>>    thinking.
>> 3) Added vm-scalability usemem data with 4K folios with
>>    CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure
>>    there is no regression with this change.
>> 4) Added data with usemem with 64K and 2M THP for an alternate view of
>>    before/after, as suggested by Yosry, so we can understand the impact
>>    of when mTHPs are split into 4K folios in shrink_folio_list()
>>    (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
>>    in zswap. Thanks Yosry for this suggestion.
>>
>> Changes since v4:
>> =================
>> 1) Published before/after data with zstd, as suggested by Nhat (Thanks
>>    Nhat for the data reviews!).
>> 2) Rebased to mm-unstable from 8/27/2024,
>>    commit b659edec079c90012cf8d05624e312d1062b8b87.
>> 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
>>    CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
>>    robot; as per Nhat's and Michal's suggestion to not require a separate
>>    patch to fix the build errors (thanks both!).
>> 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
>>    suggested by Yosry (Thanks Yosry!).
>> 5) Squashed the commits that define new mthp zswpout stat counters, and
>>    invoke count_mthp_stat() after successful zswap_store()s; into a single
>>    commit. Thanks Yosry for this suggestion!
>>
>> Changes since v3:
>> =================
>> 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
>>    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
>>    changes to count_mthp_stat() so that it's always defined, even when THP
>>    is disabled. Barry, I have also made one other change in page_io.c
>>    where count_mthp_stat() is called by count_swpout_vm_event(). I would
>>    appreciate it if you can review this. Thanks!
>>    Hopefully this should resolve the kernel robot build errors.
>>
>> Changes since v2:
>> =================
>> 1) Gathered usemem data using SSD as the backing swap device for zswap,
>>    as suggested by Ying Huang. Ying, I would appreciate it if you can
>>    review the latest data. Thanks!
>> 2) Generated the base commit info in the patches to attempt to address
>>    the kernel test robot build errors.
>> 3) No code changes to the individual patches themselves.
>>
>> Changes since RFC v1:
>> =====================
>>
>> 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
>>    Thanks Barry!
>> 2) Addressed some of the code review comments that Nhat Pham provided in
>>    Ryan's initial RFC [1]:
>>    - Added a comment about the cgroup zswap limit checks occuring once per
>>      folio at the beginning of zswap_store().
>>      Nhat, Ryan, please do let me know if the comments convey the summary
>>      from the RFC discussion. Thanks!
>>    - Posted data on running the cgroup suite's zswap kselftest.
>> 3) Rebased to v6.11-rc3.
>> 4) Gathered performance data with usemem and the rebased patch-series.
>>
>>
>> Regression Testing:
>> ===================
>> I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
>> folios with mm-unstable and with this patch-series. The main goal was
>> to make sure that there is no functional or performance regression
>> wrt the earlier zswap behavior for 4K folios,
>> CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of 4K
>> pages goes through the newly added code path [zswap_store(),
>> zswap_store_page()].
>>
>> The data indicates there is no regression.
>>
>>  ------------------------------------------------------------------------------
>>                      mm-unstable 8-28-2024                        zswap-mTHP v6
>>                                               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
>>                                                                      is not set
>>  ------------------------------------------------------------------------------
>>  ZSWAP compressor        zstd     deflate-                     zstd    deflate-
>>                                        iaa                                  iaa
>>  ------------------------------------------------------------------------------
>>  Throughput (KB/s)    110,775      113,010               111,550        121,937
>>  sys time (sec)      1,141.72       954.87              1,131.95         828.47
>>  memcg_high           140,500      153,737               139,772        134,129
>>  memcg_swap_high            0            0                     0              0
>>  memcg_swap_fail            0            0                     0              0
>>  pswpin                     0            0                     0              0
>>  pswpout                    0            0                     0              0
>>  zswpin                   675          690                   682            684
>>  zswpout            9,552,298   10,603,271             9,566,392      9,267,213
>>  thp_swpout                 0            0                     0              0
>>  thp_swpout_                0            0                     0              0
>>   fallback
>>  pgmajfault             3,453        3,468                 3,841          3,487
>>  ZSWPOUT-64kB-mTHP        n/a          n/a                     0              0
>>  SWPOUT-64kB-mTHP           0            0                     0              0
>>  ------------------------------------------------------------------------------
>>
>>
>> Performance Testing:
>> ====================
>> Testing of this patch-series was done with the v6.11-rc3 mainline, without
>> and with this patch-series, on an Intel Sapphire Rapids server,
>> dual-socket 56 cores per socket, 4 IAA devices per socket.
>>
>> The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as the
>> backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
>> Core frequency was fixed at 2500MHz.
>>
>> The vm-scalability "usemem" test was run in a cgroup whose memory.high
>> was fixed at 40G. The is no swap limit set for the cgroup. Following a
>> similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
>> series [2], 70 usemem processes were run, each allocating and writing 1G of
>> memory:
>>
>>     usemem --init-time -w -O -n 70 1g
>>
>> The vm/sysfs mTHP stats included with the performance data provide details
>> on the swapout activity to ZSWAP/swap.
>>
>> Other kernel configuration parameters:
>>
>>     ZSWAP Compressors : zstd, deflate-iaa
>>     ZSWAP Allocator   : zsmalloc
>>     SWAP page-cluster : 2
>>
>> In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
>> IAA "compression verification" is enabled. Hence each IAA compression
>> will be decompressed internally by the "iaa_crypto" driver, the crc-s
>> returned by the hardware will be compared and errors reported in case of
>> mismatches. Thus "deflate-iaa" helps ensure better data integrity as
>> compared to the software compressors.
>>
>> Throughput is derived by averaging the individual 70 processes' throughputs
>> reported by usemem. sys time is measured with perf. All data points are
>> averaged across 3 runs.
>>
>> Case 1: Baseline with CONFIG_THP_SWAP turned off, and mTHP is split in reclaim.
>> ===============================================================================
>>
>> In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
>> 64K/2M (m)THP to be split, and only 4K folios processed by zswap.
>>
>> The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results
>> in 64K/2M (m)THP to not be split, and processed by zswap.
>>
>>  64KB mTHP (cgroup memory.high set to 40G):
>>  ==========================================
>>
>>  -------------------------------------------------------------------------------
>>                        v6.11-rc3 mainline              zswap-mTHP     Change wrt
>>                                  Baseline                               Baseline
>>                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
>>  -------------------------------------------------------------------------------
>>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>>                                       iaa                     iaa            iaa
>>  -------------------------------------------------------------------------------
>>  Throughput (KB/s)   136,113      140,044     140,363     151,938    3%       8%
>>  sys time (sec)       986.78       951.95      954.85      735.47    3%      23%
>>  memcg_high          124,183      127,513     138,651     133,884
>>  memcg_swap_high           0            0           0           0
>>  memcg_swap_fail     619,020      751,099           0           0
>>  pswpin                    0            0           0           0
>>  pswpout                   0            0           0           0
>>  zswpin                  656          569         624         639
>>  zswpout           9,413,603   11,284,812   9,453,761   9,385,910
>>  thp_swpout                0            0           0           0
>>  thp_swpout_               0            0           0           0
>>   fallback
>>  pgmajfault            3,470        3,382       4,633       3,611
>>  ZSWPOUT-64kB            n/a          n/a     590,768     586,521
>>  SWPOUT-64kB               0            0           0           0
>>  -------------------------------------------------------------------------------
>>
>>
>>  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
>>  =======================================================
>>
>>  ------------------------------------------------------------------------------
>>                        v6.11-rc3 mainline              zswap-mTHP    Change wrt
>>                                  Baseline                              Baseline
>>                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
>>  ------------------------------------------------------------------------------
>>  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
>>                                      iaa                     iaa            iaa
>>  ------------------------------------------------------------------------------
>>  Throughput (KB/s)    164,220    172,523      165,005     174,536  0.5%      1%
>>  sys time (sec)        855.76     686.94       801.72      676.65    6%      1%
>>  memcg_high            14,628     16,247       14,951      16,096
>>  memcg_swap_high            0          0            0           0
>>  memcg_swap_fail       18,698     21,114            0           0
>>  pswpin                     0          0            0           0
>>  pswpout                    0          0            0           0
>>  zswpin                   663        665        5,333         781
>>  zswpout            8,419,458  8,992,065    8,546,895   9,355,760
>>  thp_swpout                 0          0            0           0
>>  thp_swpout_           18,697     21,113            0           0
>>   fallback
>>  pgmajfault             3,439      3,496        8,139       3,582
>>  ZSWPOUT-2048kB           n/a        n/a       16,684      18,270
>>  SWPOUT-2048kB              0          0            0           0
>>  -----------------------------------------------------------------------------
>>
>> We see improvements overall in throughput and sys time for zstd and
>> deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y).
>>
>>
>> Case 2: Baseline with CONFIG_THP_SWAP enabled.
>> ==============================================
>>
>> In this scenario, the "before" represents zswap rejecting mTHP, and the mTHP
>> being stored by the backing swap device.
>>
>> The "after" represents data with this patch-series, that results in 64K/2M
>> (m)THP being processed by zswap.
>>
>>  64KB mTHP (cgroup memory.high set to 40G):
>>  ==========================================
>>
>>  ------------------------------------------------------------------------------
>>                      v6.11-rc3 mainline              zswap-mTHP      Change wrt
>>                                Baseline                                Baseline
>>  ------------------------------------------------------------------------------
>>  ZSWAP compressor       zstd   deflate-        zstd    deflate-   zstd deflate-
>>                                     iaa                     iaa             iaa
>>  ------------------------------------------------------------------------------
>>  Throughput (KB/s)   161,496    156,343     140,363     151,938   -13%      -3%
>>  sys time (sec)       771.68     802.08      954.85      735.47   -24%       8%
>>  memcg_high          111,223    110,889     138,651     133,884
>>  memcg_swap_high           0          0           0           0
>>  memcg_swap_fail           0          0           0           0
>>  pswpin                   16         16           0           0
>>  pswpout           7,471,472  7,527,963           0           0
>>  zswpin                  635        605         624         639
>>  zswpout               1,509      1,478   9,453,761   9,385,910
>>  thp_swpout                0          0           0           0
>>  thp_swpout_               0          0           0           0
>>   fallback
>>  pgmajfault            3,616      3,430       4,633       3,611
>>  ZSWPOUT-64kB            n/a        n/a     590,768     586,521
>>  SWPOUT-64kB         466,967    470,498           0           0
>>  ------------------------------------------------------------------------------
>>
>>  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
>>  =======================================================
>>
>>  ------------------------------------------------------------------------------
>>                       v6.11-rc3 mainline              zswap-mTHP     Change wrt
>>                                 Baseline                               Baseline
>>  ------------------------------------------------------------------------------
>>  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
>>                                      iaa                     iaa            iaa
>>  ------------------------------------------------------------------------------
>>  Throughput (KB/s)    192,164    194,643     165,005     174,536  -14%     -10%
>>  sys time (sec)        823.55     830.42      801.72      676.65    3%      19%
>>  memcg_high            16,054     15,936      14,951      16,096
>>  memcg_swap_high            0          0           0           0
>>  memcg_swap_fail            0          0           0           0
>>  pswpin                     0          0           0           0
>>  pswpout            8,629,248  8,628,907           0           0
>>  zswpin                   560        645       5,333         781
>>  zswpout                1,416      1,503   8,546,895   9,355,760
>>  thp_swpout            16,854     16,853           0           0
>>  thp_swpout_                0          0           0           0
>>   fallback
>>  pgmajfault             3,341      3,574       8,139       3,582
>>  ZSWPOUT-2048kB           n/a        n/a      16,684      18,270
>>  SWPOUT-2048kB         16,854     16,853           0           0
>>  ------------------------------------------------------------------------------
>>
>> In the "Before" scenario, when zswap does not store mTHP, only allocations
>> count towards the cgroup memory limit. However, in the "After" scenario,
>> with the introduction of zswap_store() mTHP, both, allocations as well as
>> the zswap compressed pool usage from all 70 processes are counted towards
>> the memory limit. As a result, we see higher swapout activity in the
>> "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
>> charge leads to more frequent memory.high breaches.
>>
>> This causes degradation in throughput and sys time with zswap mTHP, more so
>> in case of zstd than deflate-iaa. Compress latency could play a part in
>> this - when there is more swapout activity happening, a slower compressor
>> would cause allocations to stall for any/all of the 70 processes.
>
> We are basically comparing zram with zswap in this case, and it's not
> fair because, as you mentioned, the zswap compressed data is being
> accounted for while the zram compressed data isn't. I am not really
> sure how valuable these test results are. Even if we remove the cgroup
> accounting from zswap, we won't see an improvement, we should expect a
> similar performance to zram.
>
> I think the test results that are really valuable are case 1, where
> zswap users are currently disabling CONFIG_THP_SWAP, and get to enable
> it after this series.
>
> If we really want to compare CONFIG_THP_SWAP on before and after, it
> should be with SSD because that's a more conventional setup. In this
> case the users that have CONFIG_THP_SWAP=y only experience the
> benefits of zswap with this series.

Yes.  I think so too.

> You mentioned experimenting with
> usemem to keep the memory allocated longer so that you're able to have
> a fair test with the small SSD swap setup. Did that work?

Looking forward to the results of this test too.

> I am hoping Nhat or Johannes would shed some light on whether they
> usually have CONFIG_THP_SWAP enabled or not with zswap. I am trying to
> figure out if any reasonable setups enable CONFIG_THP_SWAP with zswap.
> Otherwise the testing results from case 1 should be sufficient.

I guess that even if 2MB THP swapping may be not popular, 64KB mTHP
swapping to SSD or zswap looks much more appealing.

>>
>> In my opinion, even though the test set up does not provide an accurate
>> way for a direct before/after comparison (because of zswap usage being
>> counted in cgroup, hence towards the memory.high), it still seems
>> reasonable for zswap_store to support (m)THP, so that further performance
>> improvements can be implemented.
>
> This is only referring to the results of case 2, right?
>
> Honestly, I wouldn't want to merge mTHP swapout support on its own
> just because it enables further performance improvements without
> having actual patches for them. But I don't think this captures the
> results accurately as it dismisses case 1 results (which I think are
> more reasonable).

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-08-30  9:27   ` Huang, Ying
@ 2024-09-20  2:41     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 34+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-20  2:41 UTC (permalink / raw)
  To: Huang, Ying, Yosry Ahmed
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, nphamcs@gmail.com, chengming.zhou@linux.dev,
	usamaarif642@gmail.com, ryan.roberts@arm.com, 21cnbao@gmail.com,
	akpm@linux-foundation.org, Zou, Nanhai, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P

Hi Ying,

> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Friday, August 30, 2024 2:28 AM
> To: Yosry Ahmed <yosryahmed@google.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali,
> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
> 
> Yosry Ahmed <yosryahmed@google.com> writes:
> 
> > On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
> > <kanchana.p.sridhar@intel.com> wrote:
> >>
> >> Hi All,
> >>
> >> This patch-series enables zswap_store() to accept and store mTHP
> >> folios. The most significant contribution in this series is from the
> >> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> >> migrated to v6.11-rc3 in patch 2/4 of this series.
> >>
> >> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> >>      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >>
> >> Additionally, there is an attempt to modularize some of the functionality
> >> in zswap_store(), to make it more amenable to supporting any-order
> >> mTHPs. For instance, the function zswap_store_entry() stores a
> zswap_entry
> >> in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> >> delete all offsets corresponding to a higher order folio stored in zswap.
> >>
> >> For accounting purposes, the patch-series adds per-order mTHP sysfs
> >> "zswpout" counters that get incremented upon successful zswap_store of
> >> an mTHP folio:
> >>
> >> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >>
> >> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by
> default)
> >> will enable/disable zswap storing of (m)THP. When disabled, zswap will
> >> fallback to rejecting the mTHP folio, to be processed by the backing
> >> swap device.
> >>
> >> This patch-series is a precursor to ZSWAP compress batching of mTHP
> >> swap-out and decompress batching of swap-ins based on
> swapin_readahead(),
> >> using Intel IAA hardware acceleration, which we would like to submit in
> >> subsequent RFC patch-series, with performance improvement data.
> >>
> >> Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >>
> >> Thanks also to Nhat, Yosry and Barry for their helpful feedback, data
> >> reviews and suggestions!
> >>
> >> Changes since v5:
> >> =================
> >> 1) Rebased to mm-unstable as of 8/29/2024,
> >>    commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
> >> 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
> >>    enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
> >>    suggestion to add a knob by which users can enable/disable this
> >>    change. Nhat, I hope this is along the lines of what you were
> >>    thinking.
> >> 3) Added vm-scalability usemem data with 4K folios with
> >>    CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make
> sure
> >>    there is no regression with this change.
> >> 4) Added data with usemem with 64K and 2M THP for an alternate view of
> >>    before/after, as suggested by Yosry, so we can understand the impact
> >>    of when mTHPs are split into 4K folios in shrink_folio_list()
> >>    (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and
> stored
> >>    in zswap. Thanks Yosry for this suggestion.
> >>
> >> Changes since v4:
> >> =================
> >> 1) Published before/after data with zstd, as suggested by Nhat (Thanks
> >>    Nhat for the data reviews!).
> >> 2) Rebased to mm-unstable from 8/27/2024,
> >>    commit b659edec079c90012cf8d05624e312d1062b8b87.
> >> 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get()
> if
> >>    CONFIG_MEMCG is not defined, to resolve build errors reported by
> kernel
> >>    robot; as per Nhat's and Michal's suggestion to not require a separate
> >>    patch to fix the build errors (thanks both!).
> >> 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
> >>    suggested by Yosry (Thanks Yosry!).
> >> 5) Squashed the commits that define new mthp zswpout stat counters, and
> >>    invoke count_mthp_stat() after successful zswap_store()s; into a single
> >>    commit. Thanks Yosry for this suggestion!
> >>
> >> Changes since v3:
> >> =================
> >> 1) Rebased to mm-unstable commit
> 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> >>    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> >>    changes to count_mthp_stat() so that it's always defined, even when THP
> >>    is disabled. Barry, I have also made one other change in page_io.c
> >>    where count_mthp_stat() is called by count_swpout_vm_event(). I would
> >>    appreciate it if you can review this. Thanks!
> >>    Hopefully this should resolve the kernel robot build errors.
> >>
> >> Changes since v2:
> >> =================
> >> 1) Gathered usemem data using SSD as the backing swap device for zswap,
> >>    as suggested by Ying Huang. Ying, I would appreciate it if you can
> >>    review the latest data. Thanks!
> >> 2) Generated the base commit info in the patches to attempt to address
> >>    the kernel test robot build errors.
> >> 3) No code changes to the individual patches themselves.
> >>
> >> Changes since RFC v1:
> >> =====================
> >>
> >> 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> >>    Thanks Barry!
> >> 2) Addressed some of the code review comments that Nhat Pham provided
> in
> >>    Ryan's initial RFC [1]:
> >>    - Added a comment about the cgroup zswap limit checks occuring once
> per
> >>      folio at the beginning of zswap_store().
> >>      Nhat, Ryan, please do let me know if the comments convey the
> summary
> >>      from the RFC discussion. Thanks!
> >>    - Posted data on running the cgroup suite's zswap kselftest.
> >> 3) Rebased to v6.11-rc3.
> >> 4) Gathered performance data with usemem and the rebased patch-series.
> >>
> >>
> >> Regression Testing:
> >> ===================
> >> I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
> >> folios with mm-unstable and with this patch-series. The main goal was
> >> to make sure that there is no functional or performance regression
> >> wrt the earlier zswap behavior for 4K folios,
> >> CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store()
> of 4K
> >> pages goes through the newly added code path [zswap_store(),
> >> zswap_store_page()].
> >>
> >> The data indicates there is no regression.
> >>
> >>  ------------------------------------------------------------------------------
> >>                      mm-unstable 8-28-2024                        zswap-mTHP v6
> >>                                               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
> >>                                                                      is not set
> >>  ------------------------------------------------------------------------------
> >>  ZSWAP compressor        zstd     deflate-                     zstd    deflate-
> >>                                        iaa                                  iaa
> >>  ------------------------------------------------------------------------------
> >>  Throughput (KB/s)    110,775      113,010               111,550        121,937
> >>  sys time (sec)      1,141.72       954.87              1,131.95         828.47
> >>  memcg_high           140,500      153,737               139,772        134,129
> >>  memcg_swap_high            0            0                     0              0
> >>  memcg_swap_fail            0            0                     0              0
> >>  pswpin                     0            0                     0              0
> >>  pswpout                    0            0                     0              0
> >>  zswpin                   675          690                   682            684
> >>  zswpout            9,552,298   10,603,271             9,566,392      9,267,213
> >>  thp_swpout                 0            0                     0              0
> >>  thp_swpout_                0            0                     0              0
> >>   fallback
> >>  pgmajfault             3,453        3,468                 3,841          3,487
> >>  ZSWPOUT-64kB-mTHP        n/a          n/a                     0              0
> >>  SWPOUT-64kB-mTHP           0            0                     0              0
> >>  ------------------------------------------------------------------------------
> >>
> >>
> >> Performance Testing:
> >> ====================
> >> Testing of this patch-series was done with the v6.11-rc3 mainline, without
> >> and with this patch-series, on an Intel Sapphire Rapids server,
> >> dual-socket 56 cores per socket, 4 IAA devices per socket.
> >>
> >> The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM)
> as the
> >> backing swap device for ZSWAP. zstd is configured as the ZRAM
> compressor.
> >> Core frequency was fixed at 2500MHz.
> >>
> >> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> >> was fixed at 40G. The is no swap limit set for the cgroup. Following a
> >> similar methodology as in Ryan Roberts' "Swap-out mTHP without
> splitting"
> >> series [2], 70 usemem processes were run, each allocating and writing 1G
> of
> >> memory:
> >>
> >>     usemem --init-time -w -O -n 70 1g
> >>
> >> The vm/sysfs mTHP stats included with the performance data provide
> details
> >> on the swapout activity to ZSWAP/swap.
> >>
> >> Other kernel configuration parameters:
> >>
> >>     ZSWAP Compressors : zstd, deflate-iaa
> >>     ZSWAP Allocator   : zsmalloc
> >>     SWAP page-cluster : 2
> >>
> >> In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> >> IAA "compression verification" is enabled. Hence each IAA compression
> >> will be decompressed internally by the "iaa_crypto" driver, the crc-s
> >> returned by the hardware will be compared and errors reported in case of
> >> mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> >> compared to the software compressors.
> >>
> >> Throughput is derived by averaging the individual 70 processes'
> throughputs
> >> reported by usemem. sys time is measured with perf. All data points are
> >> averaged across 3 runs.
> >>
> >> Case 1: Baseline with CONFIG_THP_SWAP turned off, and mTHP is split in
> reclaim.
> >>
> ==============================================================
> =================
> >>
> >> In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results
> in
> >> 64K/2M (m)THP to be split, and only 4K folios processed by zswap.
> >>
> >> The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
> results
> >> in 64K/2M (m)THP to not be split, and processed by zswap.
> >>
> >>  64KB mTHP (cgroup memory.high set to 40G):
> >>  ==========================================
> >>
> >>  -------------------------------------------------------------------------------
> >>                        v6.11-rc3 mainline              zswap-mTHP     Change wrt
> >>                                  Baseline                               Baseline
> >>                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
> >>  -------------------------------------------------------------------------------
> >>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >>                                       iaa                     iaa            iaa
> >>  -------------------------------------------------------------------------------
> >>  Throughput (KB/s)   136,113      140,044     140,363     151,938    3%       8%
> >>  sys time (sec)       986.78       951.95      954.85      735.47    3%      23%
> >>  memcg_high          124,183      127,513     138,651     133,884
> >>  memcg_swap_high           0            0           0           0
> >>  memcg_swap_fail     619,020      751,099           0           0
> >>  pswpin                    0            0           0           0
> >>  pswpout                   0            0           0           0
> >>  zswpin                  656          569         624         639
> >>  zswpout           9,413,603   11,284,812   9,453,761   9,385,910
> >>  thp_swpout                0            0           0           0
> >>  thp_swpout_               0            0           0           0
> >>   fallback
> >>  pgmajfault            3,470        3,382       4,633       3,611
> >>  ZSWPOUT-64kB            n/a          n/a     590,768     586,521
> >>  SWPOUT-64kB               0            0           0           0
> >>  -------------------------------------------------------------------------------
> >>
> >>
> >>  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> >>  =======================================================
> >>
> >>  ------------------------------------------------------------------------------
> >>                        v6.11-rc3 mainline              zswap-mTHP    Change wrt
> >>                                  Baseline                              Baseline
> >>                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
> >>  ------------------------------------------------------------------------------
> >>  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
> >>                                      iaa                     iaa            iaa
> >>  ------------------------------------------------------------------------------
> >>  Throughput (KB/s)    164,220    172,523      165,005     174,536  0.5%      1%
> >>  sys time (sec)        855.76     686.94       801.72      676.65    6%      1%
> >>  memcg_high            14,628     16,247       14,951      16,096
> >>  memcg_swap_high            0          0            0           0
> >>  memcg_swap_fail       18,698     21,114            0           0
> >>  pswpin                     0          0            0           0
> >>  pswpout                    0          0            0           0
> >>  zswpin                   663        665        5,333         781
> >>  zswpout            8,419,458  8,992,065    8,546,895   9,355,760
> >>  thp_swpout                 0          0            0           0
> >>  thp_swpout_           18,697     21,113            0           0
> >>   fallback
> >>  pgmajfault             3,439      3,496        8,139       3,582
> >>  ZSWPOUT-2048kB           n/a        n/a       16,684      18,270
> >>  SWPOUT-2048kB              0          0            0           0
> >>  -----------------------------------------------------------------------------
> >>
> >> We see improvements overall in throughput and sys time for zstd and
> >> deflate-iaa, when comparing before (THP_SWAP=N) vs. after
> (THP_SWAP=Y).
> >>
> >>
> >> Case 2: Baseline with CONFIG_THP_SWAP enabled.
> >> ==============================================
> >>
> >> In this scenario, the "before" represents zswap rejecting mTHP, and the
> mTHP
> >> being stored by the backing swap device.
> >>
> >> The "after" represents data with this patch-series, that results in 64K/2M
> >> (m)THP being processed by zswap.
> >>
> >>  64KB mTHP (cgroup memory.high set to 40G):
> >>  ==========================================
> >>
> >>  ------------------------------------------------------------------------------
> >>                      v6.11-rc3 mainline              zswap-mTHP      Change wrt
> >>                                Baseline                                Baseline
> >>  ------------------------------------------------------------------------------
> >>  ZSWAP compressor       zstd   deflate-        zstd    deflate-   zstd deflate-
> >>                                     iaa                     iaa             iaa
> >>  ------------------------------------------------------------------------------
> >>  Throughput (KB/s)   161,496    156,343     140,363     151,938   -13%      -3%
> >>  sys time (sec)       771.68     802.08      954.85      735.47   -24%       8%
> >>  memcg_high          111,223    110,889     138,651     133,884
> >>  memcg_swap_high           0          0           0           0
> >>  memcg_swap_fail           0          0           0           0
> >>  pswpin                   16         16           0           0
> >>  pswpout           7,471,472  7,527,963           0           0
> >>  zswpin                  635        605         624         639
> >>  zswpout               1,509      1,478   9,453,761   9,385,910
> >>  thp_swpout                0          0           0           0
> >>  thp_swpout_               0          0           0           0
> >>   fallback
> >>  pgmajfault            3,616      3,430       4,633       3,611
> >>  ZSWPOUT-64kB            n/a        n/a     590,768     586,521
> >>  SWPOUT-64kB         466,967    470,498           0           0
> >>  ------------------------------------------------------------------------------
> >>
> >>  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> >>  =======================================================
> >>
> >>  ------------------------------------------------------------------------------
> >>                       v6.11-rc3 mainline              zswap-mTHP     Change wrt
> >>                                 Baseline                               Baseline
> >>  ------------------------------------------------------------------------------
> >>  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
> >>                                      iaa                     iaa            iaa
> >>  ------------------------------------------------------------------------------
> >>  Throughput (KB/s)    192,164    194,643     165,005     174,536  -14%     -
> 10%
> >>  sys time (sec)        823.55     830.42      801.72      676.65    3%      19%
> >>  memcg_high            16,054     15,936      14,951      16,096
> >>  memcg_swap_high            0          0           0           0
> >>  memcg_swap_fail            0          0           0           0
> >>  pswpin                     0          0           0           0
> >>  pswpout            8,629,248  8,628,907           0           0
> >>  zswpin                   560        645       5,333         781
> >>  zswpout                1,416      1,503   8,546,895   9,355,760
> >>  thp_swpout            16,854     16,853           0           0
> >>  thp_swpout_                0          0           0           0
> >>   fallback
> >>  pgmajfault             3,341      3,574       8,139       3,582
> >>  ZSWPOUT-2048kB           n/a        n/a      16,684      18,270
> >>  SWPOUT-2048kB         16,854     16,853           0           0
> >>  ------------------------------------------------------------------------------
> >>
> >> In the "Before" scenario, when zswap does not store mTHP, only
> allocations
> >> count towards the cgroup memory limit. However, in the "After" scenario,
> >> with the introduction of zswap_store() mTHP, both, allocations as well as
> >> the zswap compressed pool usage from all 70 processes are counted
> towards
> >> the memory limit. As a result, we see higher swapout activity in the
> >> "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> >> charge leads to more frequent memory.high breaches.
> >>
> >> This causes degradation in throughput and sys time with zswap mTHP,
> more so
> >> in case of zstd than deflate-iaa. Compress latency could play a part in
> >> this - when there is more swapout activity happening, a slower
> compressor
> >> would cause allocations to stall for any/all of the 70 processes.
> >
> > We are basically comparing zram with zswap in this case, and it's not
> > fair because, as you mentioned, the zswap compressed data is being
> > accounted for while the zram compressed data isn't. I am not really
> > sure how valuable these test results are. Even if we remove the cgroup
> > accounting from zswap, we won't see an improvement, we should expect a
> > similar performance to zram.
> >
> > I think the test results that are really valuable are case 1, where
> > zswap users are currently disabling CONFIG_THP_SWAP, and get to enable
> > it after this series.
> >
> > If we really want to compare CONFIG_THP_SWAP on before and after, it
> > should be with SSD because that's a more conventional setup. In this
> > case the users that have CONFIG_THP_SWAP=y only experience the
> > benefits of zswap with this series.
> 
> Yes.  I think so too.
> 
> > You mentioned experimenting with
> > usemem to keep the memory allocated longer so that you're able to have
> > a fair test with the small SSD swap setup. Did that work?
> 
> Looking forward to the results of this test too.

I just posted the data from this test in the 4G SSD setup, in response
to Yosry's comments. Please do review the data and let me know if
you have any questions/suggestions.

Thanks,
Kanchana

> 
> > I am hoping Nhat or Johannes would shed some light on whether they
> > usually have CONFIG_THP_SWAP enabled or not with zswap. I am trying to
> > figure out if any reasonable setups enable CONFIG_THP_SWAP with zswap.
> > Otherwise the testing results from case 1 should be sufficient.
> 
> I guess that even if 2MB THP swapping may be not popular, 64KB mTHP
> swapping to SSD or zswap looks much more appealing.

The data I posted today is for 64k mTHP. We see better usemem throughput
with zswap-mTHP as compared to SSD-mTHP.

Thanks,
Kanchana

> 
> >>
> >> In my opinion, even though the test set up does not provide an accurate
> >> way for a direct before/after comparison (because of zswap usage being
> >> counted in cgroup, hence towards the memory.high), it still seems
> >> reasonable for zswap_store to support (m)THP, so that further
> performance
> >> improvements can be implemented.
> >
> > This is only referring to the results of case 2, right?
> >
> > Honestly, I wouldn't want to merge mTHP swapout support on its own
> > just because it enables further performance improvements without
> > having actual patches for them. But I don't think this captures the
> > results accurately as it dismisses case 1 results (which I think are
> > more reasonable).
> 
> --
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-08-29 22:48 ` [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios Yosry Ahmed
  2024-08-29 23:45   ` Nhat Pham
  2024-08-30  9:27   ` Huang, Ying
@ 2024-09-20  1:41   ` Sridhar, Kanchana P
  2024-09-20  9:29     ` Huang, Ying
  2024-09-20 23:15     ` Yosry Ahmed
  2 siblings, 2 replies; 34+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-20  1:41 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, nphamcs@gmail.com, chengming.zhou@linux.dev,
	usamaarif642@gmail.com, ryan.roberts@arm.com, Huang, Ying,
	21cnbao@gmail.com, akpm@linux-foundation.org, Zou, Nanhai,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

Hi Yosry,

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Thursday, August 29, 2024 3:49 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
> 
> On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store mTHP
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to v6.11-rc3 in patch 2/4 of this series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> >      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Additionally, there is an attempt to modularize some of the functionality
> > in zswap_store(), to make it more amenable to supporting any-order
> > mTHPs. For instance, the function zswap_store_entry() stores a
> zswap_entry
> > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> > delete all offsets corresponding to a higher order folio stored in zswap.
> >
> > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > "zswpout" counters that get incremented upon successful zswap_store of
> > an mTHP folio:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by
> default)
> > will enable/disable zswap storing of (m)THP. When disabled, zswap will
> > fallback to rejecting the mTHP folio, to be processed by the backing
> > swap device.
> >
> > This patch-series is a precursor to ZSWAP compress batching of mTHP
> > swap-out and decompress batching of swap-ins based on
> swapin_readahead(),
> > using Intel IAA hardware acceleration, which we would like to submit in
> > subsequent RFC patch-series, with performance improvement data.
> >
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Thanks also to Nhat, Yosry and Barry for their helpful feedback, data
> > reviews and suggestions!
> >
> > Changes since v5:
> > =================
> > 1) Rebased to mm-unstable as of 8/29/2024,
> >    commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
> > 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
> >    enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
> >    suggestion to add a knob by which users can enable/disable this
> >    change. Nhat, I hope this is along the lines of what you were
> >    thinking.
> > 3) Added vm-scalability usemem data with 4K folios with
> >    CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make
> sure
> >    there is no regression with this change.
> > 4) Added data with usemem with 64K and 2M THP for an alternate view of
> >    before/after, as suggested by Yosry, so we can understand the impact
> >    of when mTHPs are split into 4K folios in shrink_folio_list()
> >    (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
> >    in zswap. Thanks Yosry for this suggestion.
> >
> > Changes since v4:
> > =================
> > 1) Published before/after data with zstd, as suggested by Nhat (Thanks
> >    Nhat for the data reviews!).
> > 2) Rebased to mm-unstable from 8/27/2024,
> >    commit b659edec079c90012cf8d05624e312d1062b8b87.
> > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
> >    CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
> >    robot; as per Nhat's and Michal's suggestion to not require a separate
> >    patch to fix the build errors (thanks both!).
> > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
> >    suggested by Yosry (Thanks Yosry!).
> > 5) Squashed the commits that define new mthp zswpout stat counters, and
> >    invoke count_mthp_stat() after successful zswap_store()s; into a single
> >    commit. Thanks Yosry for this suggestion!
> >
> > Changes since v3:
> > =================
> > 1) Rebased to mm-unstable commit
> 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> >    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> >    changes to count_mthp_stat() so that it's always defined, even when THP
> >    is disabled. Barry, I have also made one other change in page_io.c
> >    where count_mthp_stat() is called by count_swpout_vm_event(). I would
> >    appreciate it if you can review this. Thanks!
> >    Hopefully this should resolve the kernel robot build errors.
> >
> > Changes since v2:
> > =================
> > 1) Gathered usemem data using SSD as the backing swap device for zswap,
> >    as suggested by Ying Huang. Ying, I would appreciate it if you can
> >    review the latest data. Thanks!
> > 2) Generated the base commit info in the patches to attempt to address
> >    the kernel test robot build errors.
> > 3) No code changes to the individual patches themselves.
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> >    Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> >    Ryan's initial RFC [1]:
> >    - Added a comment about the cgroup zswap limit checks occuring once
> per
> >      folio at the beginning of zswap_store().
> >      Nhat, Ryan, please do let me know if the comments convey the summary
> >      from the RFC discussion. Thanks!
> >    - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> >
> > Regression Testing:
> > ===================
> > I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
> > folios with mm-unstable and with this patch-series. The main goal was
> > to make sure that there is no functional or performance regression
> > wrt the earlier zswap behavior for 4K folios,
> > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of
> 4K
> > pages goes through the newly added code path [zswap_store(),
> > zswap_store_page()].
> >
> > The data indicates there is no regression.
> >
> >  ------------------------------------------------------------------------------
> >                      mm-unstable 8-28-2024                        zswap-mTHP v6
> >                                               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
> >                                                                      is not set
> >  ------------------------------------------------------------------------------
> >  ZSWAP compressor        zstd     deflate-                     zstd    deflate-
> >                                        iaa                                  iaa
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)    110,775      113,010               111,550        121,937
> >  sys time (sec)      1,141.72       954.87              1,131.95         828.47
> >  memcg_high           140,500      153,737               139,772        134,129
> >  memcg_swap_high            0            0                     0              0
> >  memcg_swap_fail            0            0                     0              0
> >  pswpin                     0            0                     0              0
> >  pswpout                    0            0                     0              0
> >  zswpin                   675          690                   682            684
> >  zswpout            9,552,298   10,603,271             9,566,392      9,267,213
> >  thp_swpout                 0            0                     0              0
> >  thp_swpout_                0            0                     0              0
> >   fallback
> >  pgmajfault             3,453        3,468                 3,841          3,487
> >  ZSWPOUT-64kB-mTHP        n/a          n/a                     0              0
> >  SWPOUT-64kB-mTHP           0            0                     0              0
> >  ------------------------------------------------------------------------------
> >
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > and with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >
> > The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as
> the
> > backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
> > Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed at 40G. The is no swap limit set for the cgroup. Following a
> > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
> > series [2], 70 usemem processes were run, each allocating and writing 1G of
> > memory:
> >
> >     usemem --init-time -w -O -n 70 1g
> >
> > The vm/sysfs mTHP stats included with the performance data provide
> details
> > on the swapout activity to ZSWAP/swap.
> >
> > Other kernel configuration parameters:
> >
> >     ZSWAP Compressors : zstd, deflate-iaa
> >     ZSWAP Allocator   : zsmalloc
> >     SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput is derived by averaging the individual 70 processes' throughputs
> > reported by usemem. sys time is measured with perf. All data points are
> > averaged across 3 runs.
> >
> > Case 1: Baseline with CONFIG_THP_SWAP turned off, and mTHP is split in
> reclaim.
> >
> ==============================================================
> =================
> >
> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
> > 64K/2M (m)THP to be split, and only 4K folios processed by zswap.
> >
> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
> results
> > in 64K/2M (m)THP to not be split, and processed by zswap.
> >
> >  64KB mTHP (cgroup memory.high set to 40G):
> >  ==========================================
> >
> >  -------------------------------------------------------------------------------
> >                        v6.11-rc3 mainline              zswap-mTHP     Change wrt
> >                                  Baseline                               Baseline
> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)   136,113      140,044     140,363     151,938    3%       8%
> >  sys time (sec)       986.78       951.95      954.85      735.47    3%      23%
> >  memcg_high          124,183      127,513     138,651     133,884
> >  memcg_swap_high           0            0           0           0
> >  memcg_swap_fail     619,020      751,099           0           0
> >  pswpin                    0            0           0           0
> >  pswpout                   0            0           0           0
> >  zswpin                  656          569         624         639
> >  zswpout           9,413,603   11,284,812   9,453,761   9,385,910
> >  thp_swpout                0            0           0           0
> >  thp_swpout_               0            0           0           0
> >   fallback
> >  pgmajfault            3,470        3,382       4,633       3,611
> >  ZSWPOUT-64kB            n/a          n/a     590,768     586,521
> >  SWPOUT-64kB               0            0           0           0
> >  -------------------------------------------------------------------------------
> >
> >
> >  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> >  =======================================================
> >
> >  ------------------------------------------------------------------------------
> >                        v6.11-rc3 mainline              zswap-mTHP    Change wrt
> >                                  Baseline                              Baseline
> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
> >  ------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
> >                                      iaa                     iaa            iaa
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)    164,220    172,523      165,005     174,536  0.5%      1%
> >  sys time (sec)        855.76     686.94       801.72      676.65    6%      1%
> >  memcg_high            14,628     16,247       14,951      16,096
> >  memcg_swap_high            0          0            0           0
> >  memcg_swap_fail       18,698     21,114            0           0
> >  pswpin                     0          0            0           0
> >  pswpout                    0          0            0           0
> >  zswpin                   663        665        5,333         781
> >  zswpout            8,419,458  8,992,065    8,546,895   9,355,760
> >  thp_swpout                 0          0            0           0
> >  thp_swpout_           18,697     21,113            0           0
> >   fallback
> >  pgmajfault             3,439      3,496        8,139       3,582
> >  ZSWPOUT-2048kB           n/a        n/a       16,684      18,270
> >  SWPOUT-2048kB              0          0            0           0
> >  -----------------------------------------------------------------------------
> >
> > We see improvements overall in throughput and sys time for zstd and
> > deflate-iaa, when comparing before (THP_SWAP=N) vs. after
> (THP_SWAP=Y).
> >
> >
> > Case 2: Baseline with CONFIG_THP_SWAP enabled.
> > ==============================================
> >
> > In this scenario, the "before" represents zswap rejecting mTHP, and the
> mTHP
> > being stored by the backing swap device.
> >
> > The "after" represents data with this patch-series, that results in 64K/2M
> > (m)THP being processed by zswap.
> >
> >  64KB mTHP (cgroup memory.high set to 40G):
> >  ==========================================
> >
> >  ------------------------------------------------------------------------------
> >                      v6.11-rc3 mainline              zswap-mTHP      Change wrt
> >                                Baseline                                Baseline
> >  ------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd   deflate-        zstd    deflate-   zstd deflate-
> >                                     iaa                     iaa             iaa
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)   161,496    156,343     140,363     151,938   -13%      -3%
> >  sys time (sec)       771.68     802.08      954.85      735.47   -24%       8%
> >  memcg_high          111,223    110,889     138,651     133,884
> >  memcg_swap_high           0          0           0           0
> >  memcg_swap_fail           0          0           0           0
> >  pswpin                   16         16           0           0
> >  pswpout           7,471,472  7,527,963           0           0
> >  zswpin                  635        605         624         639
> >  zswpout               1,509      1,478   9,453,761   9,385,910
> >  thp_swpout                0          0           0           0
> >  thp_swpout_               0          0           0           0
> >   fallback
> >  pgmajfault            3,616      3,430       4,633       3,611
> >  ZSWPOUT-64kB            n/a        n/a     590,768     586,521
> >  SWPOUT-64kB         466,967    470,498           0           0
> >  ------------------------------------------------------------------------------
> >
> >  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> >  =======================================================
> >
> >  ------------------------------------------------------------------------------
> >                       v6.11-rc3 mainline              zswap-mTHP     Change wrt
> >                                 Baseline                               Baseline
> >  ------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
> >                                      iaa                     iaa            iaa
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)    192,164    194,643     165,005     174,536  -14%     -10%
> >  sys time (sec)        823.55     830.42      801.72      676.65    3%      19%
> >  memcg_high            16,054     15,936      14,951      16,096
> >  memcg_swap_high            0          0           0           0
> >  memcg_swap_fail            0          0           0           0
> >  pswpin                     0          0           0           0
> >  pswpout            8,629,248  8,628,907           0           0
> >  zswpin                   560        645       5,333         781
> >  zswpout                1,416      1,503   8,546,895   9,355,760
> >  thp_swpout            16,854     16,853           0           0
> >  thp_swpout_                0          0           0           0
> >   fallback
> >  pgmajfault             3,341      3,574       8,139       3,582
> >  ZSWPOUT-2048kB           n/a        n/a      16,684      18,270
> >  SWPOUT-2048kB         16,854     16,853           0           0
> >  ------------------------------------------------------------------------------
> >
> > In the "Before" scenario, when zswap does not store mTHP, only allocations
> > count towards the cgroup memory limit. However, in the "After" scenario,
> > with the introduction of zswap_store() mTHP, both, allocations as well as
> > the zswap compressed pool usage from all 70 processes are counted
> towards
> > the memory limit. As a result, we see higher swapout activity in the
> > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> > charge leads to more frequent memory.high breaches.
> >
> > This causes degradation in throughput and sys time with zswap mTHP, more
> so
> > in case of zstd than deflate-iaa. Compress latency could play a part in
> > this - when there is more swapout activity happening, a slower compressor
> > would cause allocations to stall for any/all of the 70 processes.
> 
> We are basically comparing zram with zswap in this case, and it's not
> fair because, as you mentioned, the zswap compressed data is being
> accounted for while the zram compressed data isn't. I am not really
> sure how valuable these test results are. Even if we remove the cgroup
> accounting from zswap, we won't see an improvement, we should expect a
> similar performance to zram.
> 
> I think the test results that are really valuable are case 1, where
> zswap users are currently disabling CONFIG_THP_SWAP, and get to enable
> it after this series.
> 
> If we really want to compare CONFIG_THP_SWAP on before and after, it
> should be with SSD because that's a more conventional setup. In this
> case the users that have CONFIG_THP_SWAP=y only experience the
> benefits of zswap with this series. You mentioned experimenting with
> usemem to keep the memory allocated longer so that you're able to have
> a fair test with the small SSD swap setup. Did that work?

Thanks, these are good points. I ran this experiment with mm-unstable 9-17-2024,
commit 248ba8004e76eb335d7e6079724c3ee89a011389.

Data is based on average of 3 runs of the vm-scalability "usemem" test.

 4G SSD backing zswap, each process sleeps before exiting
 ========================================================

 64KB mTHP (cgroup memory.high set to 60G, no swap limit):
 =========================================================
 CONFIG_THP_SWAP=Y
 Sapphire Rapids server with 503 GiB RAM and 4G SSD swap backing device
 for zswap.

 Experiment 1: Each process sleeps for 0 sec after allocating memory
 (usemem --init-time -w -O --sleep 0 -n 70 1g):

 -------------------------------------------------------------------------------
                    mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
                                 Baseline                               Baseline
                                 "before"                 "after"      (sleep 0)
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)   296,684      274,207     359,722     390,162    21%     42%
 sys time (sec)        92.67        93.33      251.06      237.56  -171%   -155%
 memcg_high            3,503        3,769      44,425      27,154
 memcg_swap_fail           0            0     115,814     141,936
 pswpin                   17            0           0           0
 pswpout             370,853      393,232           0           0
 zswpin                  693          123         666         667
 zswpout               1,484          123   1,366,680   1,199,645
 thp_swpout                0            0           0           0
 thp_swpout_               0            0           0           0
  fallback
 pgmajfault            3,384        2,951       3,656       3,468
 ZSWPOUT-64kB            n/a          n/a      82,940      73,121
 SWPOUT-64kB          23,178       24,577           0           0
 -------------------------------------------------------------------------------


 Experiment 2: Each process sleeps for 10 sec after allocating memory
 (usemem --init-time -w -O --sleep 10 -n 70 1g):

 -------------------------------------------------------------------------------
                    mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
                                 Baseline                               Baseline
                                 "before"                 "after"     (sleep 10)
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)    86,744       93,730     157,528     113,110    82%     21%
 sys time (sec)       308.87       315.29      477.55      629.98   -55%   -100%
 memcg_high          169,450      188,700     143,691     177,887
 memcg_swap_fail  10,131,859    9,740,646  18,738,715  19,528,110
 pswpin                   17           16           0           0
 pswpout           1,154,779    1,210,485           0           0
 zswpin                  711          659       1,016         736
 zswpout              70,212       50,128   1,235,560   1,275,917
 thp_swpout                0            0           0           0
 thp_swpout_               0            0           0           0
  fallback
 pgmajfault            6,120        6,291       8,789       6,474
 ZSWPOUT-64kB            n/a          n/a      67,587      68,912
 SWPOUT-64kB          72,174       75,655           0           0
 -------------------------------------------------------------------------------


Conclusions from the experiments:
=================================
1) zswap-mTHP improves throughput as compared to the baseline, for zstd and
   deflate-iaa.

2) Yosry's theory is proved correct in the 4G constrained swap setup.
   When the processes are constrained to sleep 10 sec after allocating
   memory, thereby keeping the memory allocated longer, the "Baseline" or
   "before" with mTHP getting stored in SSD shows a degradation of 71% in
   throughput and 238% in sys time, as compared to the "Baseline" with
   sleep 0 that benefits from serialization of disk IO not allowing all
   processes to allocate memory at the same time.

3) In the 4G SSD "sleep 0" case, zswap-mTHP shows an increase in sys time
   due to the cgroup charging and consequently higher memcg.high breaches
   and swapout activity.

   However, the "sleep 10" case's sys time seems to degrade less, and the
   memcg.high breaches and swapout activity are almost similar between the
   before/after (confirming Yosry's hypothesis). Further, the
   memcg_swap_fail activity in the "after" scenario is almost 2X that of
   the "before". This indicates failure to obtain swap offsets, resulting
   in the folio remaining active in memory.

   I tried to better understand this through the 64k mTHP swpout_fallback
   stats in the "sleep 10" zstd experiments:

   --------------------------------------------------------------
                                           "before"       "after"
   --------------------------------------------------------------
   64k mTHP swpout_fallback                 627,308       897,407
   64k folio swapouts                        72,174        67,587
   [p|z]swpout events due to 64k mTHP     1,154,779     1,081,397
   4k folio swapouts                         70,212       154,163
   --------------------------------------------------------------

   The data indicates a higher # of 64k folio swpout_fallback with
   zswap-mTHP, that co-relates with the higher memcg_swap_fail counts and
   4k folio swapouts with zswap-mTHP. Could the root-cause be fragmentation
   of the swap space due to zswap swapout being faster than SSD swapout?

> 
> I am hoping Nhat or Johannes would shed some light on whether they
> usually have CONFIG_THP_SWAP enabled or not with zswap. I am trying to
> figure out if any reasonable setups enable CONFIG_THP_SWAP with zswap.
> Otherwise the testing results from case 1 should be sufficient.
> 
> >
> > In my opinion, even though the test set up does not provide an accurate
> > way for a direct before/after comparison (because of zswap usage being
> > counted in cgroup, hence towards the memory.high), it still seems
> > reasonable for zswap_store to support (m)THP, so that further performance
> > improvements can be implemented.
> 
> This is only referring to the results of case 2, right?

To begin with, yes. With IAA batching, we can submit say, up to 8 pages
in an mTHP for parallel compression in hardware. We have also implemented
batching of any-order folios (e.g. mix of 4K/16K/64K/.. folios) reclaimed in the
shrink_folio_list() -- swap_writepage() path, that demonstrates performance
and memory savings improvements with IAA.

> 
> Honestly, I wouldn't want to merge mTHP swapout support on its own
> just because it enables further performance improvements without
> having actual patches for them. But I don't think this captures the
> results accurately as it dismisses case 1 results (which I think are
> more reasonable).

Based on the latest set of data, we do see consistent throughput improvements
with zswap mTHP swapout using zstd and deflate-IAA, as compared to a baseline
where mTHP are swapped to disk (CONFIG_THP_SWP=y).

The Intel IAA batching related patches would enable the additional performance
improvements I was referring to, only for configurations that have the hardware
acceleration, without impacting performance of software compressors.

Hence, I was thinking we could separate the patch-sets as:

1) zswap-mTHP swapout that could benefit all compressors: this patch series.
2) Additional IAA batching performance improvements that would only
   benefit users of IAA.

I would appreciate your thoughts on this.

Thanks,
Kanchana

> 
> Thnaks

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-09-20  1:41   ` Sridhar, Kanchana P
@ 2024-09-20  9:29     ` Huang, Ying
  2024-09-20 17:57       ` Sridhar, Kanchana P
  2024-09-20 23:15     ` Yosry Ahmed
  1 sibling, 1 reply; 34+ messages in thread
From: Huang, Ying @ 2024-09-20  9:29 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Yosry Ahmed, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, nphamcs@gmail.com, chengming.zhou@linux.dev,
	usamaarif642@gmail.com, ryan.roberts@arm.com, 21cnbao@gmail.com,
	akpm@linux-foundation.org, Zou, Nanhai, Feghali, Wajdi K,
	Gopal, Vinodh

"Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:

[snip]

>
> Thanks, these are good points. I ran this experiment with mm-unstable 9-17-2024,
> commit 248ba8004e76eb335d7e6079724c3ee89a011389.
>
> Data is based on average of 3 runs of the vm-scalability "usemem" test.
>
>  4G SSD backing zswap, each process sleeps before exiting
>  ========================================================
>
>  64KB mTHP (cgroup memory.high set to 60G, no swap limit):
>  =========================================================
>  CONFIG_THP_SWAP=Y
>  Sapphire Rapids server with 503 GiB RAM and 4G SSD swap backing device
>  for zswap.
>
>  Experiment 1: Each process sleeps for 0 sec after allocating memory
>  (usemem --init-time -w -O --sleep 0 -n 70 1g):
>
>  -------------------------------------------------------------------------------
>                     mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
>                                  Baseline                               Baseline
>                                  "before"                 "after"      (sleep 0)
>  -------------------------------------------------------------------------------
>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>                                       iaa                     iaa            iaa
>  -------------------------------------------------------------------------------
>  Throughput (KB/s)   296,684      274,207     359,722     390,162    21%     42%
>  sys time (sec)        92.67        93.33      251.06      237.56  -171%   -155%
>  memcg_high            3,503        3,769      44,425      27,154
>  memcg_swap_fail           0            0     115,814     141,936
>  pswpin                   17            0           0           0
>  pswpout             370,853      393,232           0           0
>  zswpin                  693          123         666         667
>  zswpout               1,484          123   1,366,680   1,199,645
>  thp_swpout                0            0           0           0
>  thp_swpout_               0            0           0           0
>   fallback
>  pgmajfault            3,384        2,951       3,656       3,468
>  ZSWPOUT-64kB            n/a          n/a      82,940      73,121
>  SWPOUT-64kB          23,178       24,577           0           0
>  -------------------------------------------------------------------------------
>
>
>  Experiment 2: Each process sleeps for 10 sec after allocating memory
>  (usemem --init-time -w -O --sleep 10 -n 70 1g):
>
>  -------------------------------------------------------------------------------
>                     mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
>                                  Baseline                               Baseline
>                                  "before"                 "after"     (sleep 10)
>  -------------------------------------------------------------------------------
>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>                                       iaa                     iaa            iaa
>  -------------------------------------------------------------------------------
>  Throughput (KB/s)    86,744       93,730     157,528     113,110    82%     21%
>  sys time (sec)       308.87       315.29      477.55      629.98   -55%   -100%

What is the elapsed time for all cases?

>  memcg_high          169,450      188,700     143,691     177,887
>  memcg_swap_fail  10,131,859    9,740,646  18,738,715  19,528,110
>  pswpin                   17           16           0           0
>  pswpout           1,154,779    1,210,485           0           0
>  zswpin                  711          659       1,016         736
>  zswpout              70,212       50,128   1,235,560   1,275,917
>  thp_swpout                0            0           0           0
>  thp_swpout_               0            0           0           0
>   fallback
>  pgmajfault            6,120        6,291       8,789       6,474
>  ZSWPOUT-64kB            n/a          n/a      67,587      68,912
>  SWPOUT-64kB          72,174       75,655           0           0
>  -------------------------------------------------------------------------------
>
>
> Conclusions from the experiments:
> =================================
> 1) zswap-mTHP improves throughput as compared to the baseline, for zstd and
>    deflate-iaa.
>
> 2) Yosry's theory is proved correct in the 4G constrained swap setup.
>    When the processes are constrained to sleep 10 sec after allocating
>    memory, thereby keeping the memory allocated longer, the "Baseline" or
>    "before" with mTHP getting stored in SSD shows a degradation of 71% in
>    throughput and 238% in sys time, as compared to the "Baseline" with

Higher sys time may come from compression with CPU vs. disk writing?

>    sleep 0 that benefits from serialization of disk IO not allowing all
>    processes to allocate memory at the same time.
>
> 3) In the 4G SSD "sleep 0" case, zswap-mTHP shows an increase in sys time
>    due to the cgroup charging and consequently higher memcg.high breaches
>    and swapout activity.
>
>    However, the "sleep 10" case's sys time seems to degrade less, and the
>    memcg.high breaches and swapout activity are almost similar between the
>    before/after (confirming Yosry's hypothesis). Further, the
>    memcg_swap_fail activity in the "after" scenario is almost 2X that of
>    the "before". This indicates failure to obtain swap offsets, resulting
>    in the folio remaining active in memory.
>
>    I tried to better understand this through the 64k mTHP swpout_fallback
>    stats in the "sleep 10" zstd experiments:
>
>    --------------------------------------------------------------
>                                            "before"       "after"
>    --------------------------------------------------------------
>    64k mTHP swpout_fallback                 627,308       897,407
>    64k folio swapouts                        72,174        67,587
>    [p|z]swpout events due to 64k mTHP     1,154,779     1,081,397
>    4k folio swapouts                         70,212       154,163
>    --------------------------------------------------------------
>
>    The data indicates a higher # of 64k folio swpout_fallback with
>    zswap-mTHP, that co-relates with the higher memcg_swap_fail counts and
>    4k folio swapouts with zswap-mTHP. Could the root-cause be fragmentation
>    of the swap space due to zswap swapout being faster than SSD swapout?
>

[snip]

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-09-20  9:29     ` Huang, Ying
@ 2024-09-20 17:57       ` Sridhar, Kanchana P
  0 siblings, 0 replies; 34+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-20 17:57 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Yosry Ahmed, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, nphamcs@gmail.com, chengming.zhou@linux.dev,
	usamaarif642@gmail.com, ryan.roberts@arm.com, 21cnbao@gmail.com,
	akpm@linux-foundation.org, Zou, Nanhai, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P

> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Friday, September 20, 2024 2:29 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Yosry Ahmed <yosryahmed@google.com>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
> 
> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:
> 
> [snip]
> 
> >
> > Thanks, these are good points. I ran this experiment with mm-unstable 9-
> 17-2024,
> > commit 248ba8004e76eb335d7e6079724c3ee89a011389.
> >
> > Data is based on average of 3 runs of the vm-scalability "usemem" test.
> >
> >  4G SSD backing zswap, each process sleeps before exiting
> >  ========================================================
> >
> >  64KB mTHP (cgroup memory.high set to 60G, no swap limit):
> >  =========================================================
> >  CONFIG_THP_SWAP=Y
> >  Sapphire Rapids server with 503 GiB RAM and 4G SSD swap backing device
> >  for zswap.
> >
> >  Experiment 1: Each process sleeps for 0 sec after allocating memory
> >  (usemem --init-time -w -O --sleep 0 -n 70 1g):
> >
> >  -------------------------------------------------------------------------------
> >                     mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
> >                                  Baseline                               Baseline
> >                                  "before"                 "after"      (sleep 0)
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)   296,684      274,207     359,722     390,162    21%     42%
> >  sys time (sec)        92.67        93.33      251.06      237.56  -171%   -155%
> >  memcg_high            3,503        3,769      44,425      27,154
> >  memcg_swap_fail           0            0     115,814     141,936
> >  pswpin                   17            0           0           0
> >  pswpout             370,853      393,232           0           0
> >  zswpin                  693          123         666         667
> >  zswpout               1,484          123   1,366,680   1,199,645
> >  thp_swpout                0            0           0           0
> >  thp_swpout_               0            0           0           0
> >   fallback
> >  pgmajfault            3,384        2,951       3,656       3,468
> >  ZSWPOUT-64kB            n/a          n/a      82,940      73,121
> >  SWPOUT-64kB          23,178       24,577           0           0
> >  -------------------------------------------------------------------------------
> >
> >
> >  Experiment 2: Each process sleeps for 10 sec after allocating memory
> >  (usemem --init-time -w -O --sleep 10 -n 70 1g):
> >
> >  -------------------------------------------------------------------------------
> >                     mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
> >                                  Baseline                               Baseline
> >                                  "before"                 "after"     (sleep 10)
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)    86,744       93,730     157,528     113,110    82%     21%
> >  sys time (sec)       308.87       315.29      477.55      629.98   -55%   -100%
> 
> What is the elapsed time for all cases?

Sure, listed below is the data for both experiments with elapsed time in row 2:

 4G SSD backing zswap, each process sleeps before exiting
 ========================================================

 64KB mTHP (cgroup memory.high set to 60G, no swap limit):
 =========================================================
 CONFIG_THP_SWAP=Y
 Sapphire Rapids server with 503 GiB RAM and 4G SSD swap backing device
 for zswap.

 Experiment 1: Each process sleeps for 0 sec after allocating memory
 (usemem --init-time -w -O --sleep 0 -n 70 1g):

 -------------------------------------------------------------------------------
                    mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
                                 Baseline                               Baseline
                                 "before"                 "after"      (sleep 0)
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)   296,684      274,207     359,722     390,162    21%     42%
 elapsed time (sec)     4.91         4.80        4.42        5.08    10%     -6%
 sys time (sec)        92.67        93.33      251.06      237.56  -171%   -155%
 memcg_high            3,503        3,769      44,425      27,154
 memcg_swap_fail           0            0     115,814     141,936
 pswpin                   17            0           0           0
 pswpout             370,853      393,232           0           0
 zswpin                  693          123         666         667
 zswpout               1,484          123   1,366,680   1,199,645
 thp_swpout                0            0           0           0
 thp_swpout_               0            0           0           0
  fallback
 pgmajfault            3,384        2,951       3,656       3,468
 ZSWPOUT-64kB            n/a          n/a      82,940      73,121
 SWPOUT-64kB          23,178       24,577           0           0
 -------------------------------------------------------------------------------


 Experiment 2: Each process sleeps for 10 sec after allocating memory
 (usemem --init-time -w -O --sleep 10 -n 70 1g):

 -------------------------------------------------------------------------------
                    mm-unstable 9-17-2024           zswap-mTHP v6     Change wrt
                                 Baseline                               Baseline
                                 "before"                 "after"     (sleep 10)
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)    86,744       93,730     157,528     113,110    82%     21%
 elapsed time (sec)    30.24        31.73       33.39       32.50   -10%     -2%
 sys time (sec)       308.87       315.29      477.55      629.98   -55%   -100%
 memcg_high          169,450      188,700     143,691     177,887
 memcg_swap_fail  10,131,859    9,740,646  18,738,715  19,528,110
 pswpin                   17           16           0           0
 pswpout           1,154,779    1,210,485           0           0
 zswpin                  711          659       1,016         736
 zswpout              70,212       50,128   1,235,560   1,275,917
 thp_swpout                0            0           0           0
 thp_swpout_               0            0           0           0
  fallback
 pgmajfault            6,120        6,291       8,789       6,474
 ZSWPOUT-64kB            n/a          n/a      67,587      68,912
 SWPOUT-64kB          72,174       75,655           0           0
 -------------------------------------------------------------------------------


> 
> >  memcg_high          169,450      188,700     143,691     177,887
> >  memcg_swap_fail  10,131,859    9,740,646  18,738,715  19,528,110
> >  pswpin                   17           16           0           0
> >  pswpout           1,154,779    1,210,485           0           0
> >  zswpin                  711          659       1,016         736
> >  zswpout              70,212       50,128   1,235,560   1,275,917
> >  thp_swpout                0            0           0           0
> >  thp_swpout_               0            0           0           0
> >   fallback
> >  pgmajfault            6,120        6,291       8,789       6,474
> >  ZSWPOUT-64kB            n/a          n/a      67,587      68,912
> >  SWPOUT-64kB          72,174       75,655           0           0
> >  -------------------------------------------------------------------------------
> >
> >
> > Conclusions from the experiments:
> > =================================
> > 1) zswap-mTHP improves throughput as compared to the baseline, for zstd
> and
> >    deflate-iaa.
> >
> > 2) Yosry's theory is proved correct in the 4G constrained swap setup.
> >    When the processes are constrained to sleep 10 sec after allocating
> >    memory, thereby keeping the memory allocated longer, the "Baseline" or
> >    "before" with mTHP getting stored in SSD shows a degradation of 71% in
> >    throughput and 238% in sys time, as compared to the "Baseline" with
> 
> Higher sys time may come from compression with CPU vs. disk writing?
> 

Here, I was comparing the "before" sys times between "sleep 10" and
"sleep 0" experiments where mTHP get stored to SSD. I was trying to
understand the increase in "before" sys time in "sleep 10", and my
analysis was this could be due to the following cycle of events:

  memory remaining allocated longer, any reclaimed memory per process
  is mostly cold memory and is not paged back in (17 pswpin for zstd),
  swap slots are not released,
  swap slot allocation failures,
  folios in the reclaim list returned to being active,
  more swapout activity in "before"/"sleep 10" (372,337 zstd) as
   compared to "before"/"sleep 0" (1,224,991 zstd),
  more sys time in "before"/"sleep 10" as compared to "before"/"sleep 0".

IOW, my takeaway from only the "before" experiments with sleep 10
vs. sleep 0 was the higher swapout activity resulting in increased
sys time.

The zswap-mTHP "after" experiments don't show significantly higher
successful swapout activity between "sleep 10" vs. "sleep 0". This is
not to say that the above cycle of events does not occur here as well,
as indicated by the higher memcg_swap_fail counts, signifying
attempted swapouts.

However, the zswap-mTHP "after" sys time increase going from
"sleep 0" to "sleep 10" is not as bad as that for "before":


   "before" = 4G SSD mTHP
   "after" = zswap-mTHP

 -------------------------------------------------------------------------
                           mm-unstable 9-17-2024             zswap-mTHP v6 
                                        Baseline
                                        "before"                   "after" 
 -------------------------------------------------------------------------
 ZSWAP compressor              zstd  deflate-iaa       zstd    deflate-iaa
 -------------------------------------------------------------------------
 "sleep 0"  sys time (sec)    92.67        93.33     251.06         237.56
 "sleep 10" sys time (sec)   308.87       315.29     477.55         629.98
 -------------------------------------------------------------------------
 "sleep 10" sys time          -233%        -238%       -90%          -165%
  vs. "sleep 0"
 -------------------------------------------------------------------------


> >    sleep 0 that benefits from serialization of disk IO not allowing all
> >    processes to allocate memory at the same time.
> >
> > 3) In the 4G SSD "sleep 0" case, zswap-mTHP shows an increase in sys time
> >    due to the cgroup charging and consequently higher memcg.high breaches
> >    and swapout activity.
> >
> >    However, the "sleep 10" case's sys time seems to degrade less, and the
> >    memcg.high breaches and swapout activity are almost similar between
> the
> >    before/after (confirming Yosry's hypothesis). Further, the
> >    memcg_swap_fail activity in the "after" scenario is almost 2X that of
> >    the "before". This indicates failure to obtain swap offsets, resulting
> >    in the folio remaining active in memory.
> >
> >    I tried to better understand this through the 64k mTHP swpout_fallback
> >    stats in the "sleep 10" zstd experiments:
> >
> >    --------------------------------------------------------------
> >                                            "before"       "after"
> >    --------------------------------------------------------------
> >    64k mTHP swpout_fallback                 627,308       897,407
> >    64k folio swapouts                        72,174        67,587
> >    [p|z]swpout events due to 64k mTHP     1,154,779     1,081,397
> >    4k folio swapouts                         70,212       154,163
> >    --------------------------------------------------------------
> >
> >    The data indicates a higher # of 64k folio swpout_fallback with
> >    zswap-mTHP, that co-relates with the higher memcg_swap_fail counts and
> >    4k folio swapouts with zswap-mTHP. Could the root-cause be
> fragmentation
> >    of the swap space due to zswap swapout being faster than SSD swapout?
> >
> 
> [snip]
> 
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-09-20  1:41   ` Sridhar, Kanchana P
  2024-09-20  9:29     ` Huang, Ying
@ 2024-09-20 23:15     ` Yosry Ahmed
  2024-09-20 23:45       ` Sridhar, Kanchana P
  1 sibling, 1 reply; 34+ messages in thread
From: Yosry Ahmed @ 2024-09-20 23:15 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, nphamcs@gmail.com, chengming.zhou@linux.dev,
	usamaarif642@gmail.com, ryan.roberts@arm.com, Huang, Ying,
	21cnbao@gmail.com, akpm@linux-foundation.org, Zou, Nanhai,
	Feghali, Wajdi K, Gopal, Vinodh

[..]
> > If we really want to compare CONFIG_THP_SWAP on before and after, it
> > should be with SSD because that's a more conventional setup. In this
> > case the users that have CONFIG_THP_SWAP=y only experience the
> > benefits of zswap with this series. You mentioned experimenting with
> > usemem to keep the memory allocated longer so that you're able to have
> > a fair test with the small SSD swap setup. Did that work?
>
> Thanks, these are good points. I ran this experiment with mm-unstable 9-17-2024,
> commit 248ba8004e76eb335d7e6079724c3ee89a011389.
>
> Data is based on average of 3 runs of the vm-scalability "usemem" test.

Thanks for the results, this makes much more sense. I see you also ran
the tests with a larger swap size, which is good. In the next
iteration, I would honestly drop the results with --sleep 0 because
it's not a fair comparison imo.

I see that in most cases we are observing higher sys time with zswap,
and sometimes even higher elapsed time, which is concerning. If the
sys time is higher when comparing zswap to SSD, but elapsed time is
not higher, this can be normal due to compression on the CPU vs.
asynchronous disk writes.

However, if the sys time increases when comparing CONFIG_THP_SWAP=n
before this series and CONFIG_THP_SWAP=y with this series (i.e.
comparing zswap with 4K vs. zswap with mTHP), then that's a problem.

Also, if the total elapsed time increases, it is also a problem.

My main concern is that synchronous compression of an mTHP may be too
expensive of an operation to do in one shot. I am wondering if we need
to implement asynchronous swapout for zswap, so that it behaves more
like swapping to disk from a reclaim perspective.

Anyway, there are too many test results now. For the next version, I
would suggest only having two different test cases:
1. Comparing zswap 4K vs zswap mTHP. This would be done by comparing
CONFIG_THP_SWAP=n to CONFIG_THP_SWAP=y as you did before.

2. Comparing SSD swap mTHP vs zswap mTHP.

In both cases, I think we want to use a sufficiently large swapfile
and make the usemem processes sleep for a while to maintain the memory
allocations. Since we already confirmed the theory about the
restricted swapfile results being due to processes immediately
exiting, I don't see value in running tests anymore with a restricted
swapfile or without sleeping.

Thanks!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-09-20 23:15     ` Yosry Ahmed
@ 2024-09-20 23:45       ` Sridhar, Kanchana P
  0 siblings, 0 replies; 34+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-20 23:45 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, nphamcs@gmail.com, chengming.zhou@linux.dev,
	usamaarif642@gmail.com, ryan.roberts@arm.com, Huang, Ying,
	21cnbao@gmail.com, akpm@linux-foundation.org, Zou, Nanhai,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Friday, September 20, 2024 4:16 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
> 
> [..]
> > > If we really want to compare CONFIG_THP_SWAP on before and after, it
> > > should be with SSD because that's a more conventional setup. In this
> > > case the users that have CONFIG_THP_SWAP=y only experience the
> > > benefits of zswap with this series. You mentioned experimenting with
> > > usemem to keep the memory allocated longer so that you're able to have
> > > a fair test with the small SSD swap setup. Did that work?
> >
> > Thanks, these are good points. I ran this experiment with mm-unstable 9-
> 17-2024,
> > commit 248ba8004e76eb335d7e6079724c3ee89a011389.
> >
> > Data is based on average of 3 runs of the vm-scalability "usemem" test.
> 
> Thanks for the results, this makes much more sense. I see you also ran
> the tests with a larger swap size, which is good. In the next
> iteration, I would honestly drop the results with --sleep 0 because
> it's not a fair comparison imo.

Thanks for the comments, Yosry. Sure, this sounds good.

> 
> I see that in most cases we are observing higher sys time with zswap,
> and sometimes even higher elapsed time, which is concerning. If the
> sys time is higher when comparing zswap to SSD, but elapsed time is
> not higher, this can be normal due to compression on the CPU vs.
> asynchronous disk writes.
> 
> However, if the sys time increases when comparing CONFIG_THP_SWAP=n
> before this series and CONFIG_THP_SWAP=y with this series (i.e.
> comparing zswap with 4K vs. zswap with mTHP), then that's a problem.
> 
> Also, if the total elapsed time increases, it is also a problem.

Agreed. So far in the "Case 1" data published in v6, that compares zswap 4k
(CONFIG_THP_SWAP=n) vs. zswap mTHP (CONFIG_THP_SWAP=y), we see
consistent reduction in sys time with this patch-series. I will confirm by
re-gathering data with v7 (will post elapsed and sys times).

> 
> My main concern is that synchronous compression of an mTHP may be too
> expensive of an operation to do in one shot. I am wondering if we need
> to implement asynchronous swapout for zswap, so that it behaves more
> like swapping to disk from a reclaim perspective.
> 
> Anyway, there are too many test results now. For the next version, I
> would suggest only having two different test cases:
> 1. Comparing zswap 4K vs zswap mTHP. This would be done by comparing
> CONFIG_THP_SWAP=n to CONFIG_THP_SWAP=y as you did before.
> 
> 2. Comparing SSD swap mTHP vs zswap mTHP.
> 
> In both cases, I think we want to use a sufficiently large swapfile
> and make the usemem processes sleep for a while to maintain the memory
> allocations. Since we already confirmed the theory about the
> restricted swapfile results being due to processes immediately
> exiting, I don't see value in running tests anymore with a restricted
> swapfile or without sleeping.

Ok, this sounds good. I will submit a v7 with all these suggestions incorporated.

Thanks,
Kanchana

> 
> Thanks!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-08-29 21:27 [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
                   ` (3 preceding siblings ...)
  2024-08-29 22:48 ` [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios Yosry Ahmed
@ 2024-09-02 14:40 ` Usama Arif
  2024-09-20 19:31   ` Sridhar, Kanchana P
  4 siblings, 1 reply; 34+ messages in thread
From: Usama Arif @ 2024-09-02 14:40 UTC (permalink / raw)
  To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosryahmed,
	nphamcs, chengming.zhou, ryan.roberts, ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal



On 29/08/2024 17:27, Kanchana P Sridhar wrote:
> Hi All,
> 
> This patch-series enables zswap_store() to accept and store mTHP
> folios. The most significant contribution in this series is from the 
> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> migrated to v6.11-rc3 in patch 2/4 of this series.
> 
> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
>      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
> 
> Additionally, there is an attempt to modularize some of the functionality
> in zswap_store(), to make it more amenable to supporting any-order
> mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry
> in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> delete all offsets corresponding to a higher order folio stored in zswap.
> 
> For accounting purposes, the patch-series adds per-order mTHP sysfs
> "zswpout" counters that get incremented upon successful zswap_store of
> an mTHP folio:
> 
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> 
> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> will enable/disable zswap storing of (m)THP. When disabled, zswap will
> fallback to rejecting the mTHP folio, to be processed by the backing
> swap device.
> 
> This patch-series is a precursor to ZSWAP compress batching of mTHP
> swap-out and decompress batching of swap-ins based on swapin_readahead(),
> using Intel IAA hardware acceleration, which we would like to submit in
> subsequent RFC patch-series, with performance improvement data.
> 
Hi Kanchana,

If I am repeating any of the questions raised in previous revisions
over here, please feel free to just point to earlier responses!

Just wanted to check what does compress batching of mTHP swap-out means?
Does it mean that zswap will not compress mTHP page by page, but will compress the entire mTHP?
If it improves performance and possibly the numbers for case 2 below, maybe its worth
adding it to this series?

> Thanks to Ying Huang for pre-posting review feedback and suggestions!
> 
> Thanks also to Nhat, Yosry and Barry for their helpful feedback, data
> reviews and suggestions!
> 
> Changes since v5:
> =================
> 1) Rebased to mm-unstable as of 8/29/2024,
>    commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
> 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
>    enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
>    suggestion to add a knob by which users can enable/disable this
>    change. Nhat, I hope this is along the lines of what you were
>    thinking.
> 3) Added vm-scalability usemem data with 4K folios with
>    CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure
>    there is no regression with this change.
> 4) Added data with usemem with 64K and 2M THP for an alternate view of
>    before/after, as suggested by Yosry, so we can understand the impact
>    of when mTHPs are split into 4K folios in shrink_folio_list()
>    (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
>    in zswap. Thanks Yosry for this suggestion.
> 
> Changes since v4:
> =================
> 1) Published before/after data with zstd, as suggested by Nhat (Thanks
>    Nhat for the data reviews!).
> 2) Rebased to mm-unstable from 8/27/2024,
>    commit b659edec079c90012cf8d05624e312d1062b8b87.
> 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
>    CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
>    robot; as per Nhat's and Michal's suggestion to not require a separate
>    patch to fix the build errors (thanks both!).
> 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
>    suggested by Yosry (Thanks Yosry!).
> 5) Squashed the commits that define new mthp zswpout stat counters, and
>    invoke count_mthp_stat() after successful zswap_store()s; into a single
>    commit. Thanks Yosry for this suggestion!
> 
> Changes since v3:
> =================
> 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
>    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
>    changes to count_mthp_stat() so that it's always defined, even when THP
>    is disabled. Barry, I have also made one other change in page_io.c
>    where count_mthp_stat() is called by count_swpout_vm_event(). I would
>    appreciate it if you can review this. Thanks!
>    Hopefully this should resolve the kernel robot build errors.
> 
> Changes since v2:
> =================
> 1) Gathered usemem data using SSD as the backing swap device for zswap,
>    as suggested by Ying Huang. Ying, I would appreciate it if you can
>    review the latest data. Thanks!
> 2) Generated the base commit info in the patches to attempt to address
>    the kernel test robot build errors.
> 3) No code changes to the individual patches themselves.
> 
> Changes since RFC v1:
> =====================
> 
> 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
>    Thanks Barry!
> 2) Addressed some of the code review comments that Nhat Pham provided in
>    Ryan's initial RFC [1]:
>    - Added a comment about the cgroup zswap limit checks occuring once per
>      folio at the beginning of zswap_store().
>      Nhat, Ryan, please do let me know if the comments convey the summary
>      from the RFC discussion. Thanks!
>    - Posted data on running the cgroup suite's zswap kselftest.
> 3) Rebased to v6.11-rc3.
> 4) Gathered performance data with usemem and the rebased patch-series.
> 
> 
> Regression Testing:
> ===================
> I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
> folios with mm-unstable and with this patch-series. The main goal was
> to make sure that there is no functional or performance regression
> wrt the earlier zswap behavior for 4K folios,
> CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of 4K
> pages goes through the newly added code path [zswap_store(),
> zswap_store_page()].
> 
> The data indicates there is no regression.
> 
>  ------------------------------------------------------------------------------
>                      mm-unstable 8-28-2024                        zswap-mTHP v6
>                                               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
>                                                                      is not set
>  ------------------------------------------------------------------------------
>  ZSWAP compressor        zstd     deflate-                     zstd    deflate-
>                                        iaa                                  iaa
>  ------------------------------------------------------------------------------
>  Throughput (KB/s)    110,775      113,010               111,550        121,937
>  sys time (sec)      1,141.72       954.87              1,131.95         828.47
>  memcg_high           140,500      153,737               139,772        134,129
>  memcg_swap_high            0            0                     0              0
>  memcg_swap_fail            0            0                     0              0
>  pswpin                     0            0                     0              0
>  pswpout                    0            0                     0              0
>  zswpin                   675          690                   682            684
>  zswpout            9,552,298   10,603,271             9,566,392      9,267,213
>  thp_swpout                 0            0                     0              0
>  thp_swpout_                0            0                     0              0
>   fallback                                                                     
>  pgmajfault             3,453        3,468                 3,841          3,487
>  ZSWPOUT-64kB-mTHP        n/a          n/a                     0              0
>  SWPOUT-64kB-mTHP           0            0                     0              0
>  ------------------------------------------------------------------------------
>                                                  
> 
> Performance Testing:
> ====================
> Testing of this patch-series was done with the v6.11-rc3 mainline, without
> and with this patch-series, on an Intel Sapphire Rapids server,
> dual-socket 56 cores per socket, 4 IAA devices per socket.
> 
> The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as the
> backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
> Core frequency was fixed at 2500MHz.
> 
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 40G. The is no swap limit set for the cgroup. Following a
> similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
> series [2], 70 usemem processes were run, each allocating and writing 1G of
> memory:
> 
>     usemem --init-time -w -O -n 70 1g
> 
> The vm/sysfs mTHP stats included with the performance data provide details
> on the swapout activity to ZSWAP/swap.
> 
> Other kernel configuration parameters:
> 
>     ZSWAP Compressors : zstd, deflate-iaa
>     ZSWAP Allocator   : zsmalloc
>     SWAP page-cluster : 2
> 
> In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> IAA "compression verification" is enabled. Hence each IAA compression
> will be decompressed internally by the "iaa_crypto" driver, the crc-s
> returned by the hardware will be compared and errors reported in case of
> mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> compared to the software compressors.
> 
> Throughput is derived by averaging the individual 70 processes' throughputs
> reported by usemem. sys time is measured with perf. All data points are
> averaged across 3 runs.
> 
> Case 1: Baseline with CONFIG_THP_SWAP turned off, and mTHP is split in reclaim.
> ===============================================================================
> 
> In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
> 64K/2M (m)THP to be split, and only 4K folios processed by zswap.
> 
> The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results
> in 64K/2M (m)THP to not be split, and processed by zswap.
> 
>  64KB mTHP (cgroup memory.high set to 40G):
>  ==========================================
> 
>  -------------------------------------------------------------------------------
>                        v6.11-rc3 mainline              zswap-mTHP     Change wrt
>                                  Baseline                               Baseline
>                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
>  -------------------------------------------------------------------------------
>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>                                       iaa                     iaa            iaa
>  -------------------------------------------------------------------------------
>  Throughput (KB/s)   136,113      140,044     140,363     151,938    3%       8%
>  sys time (sec)       986.78       951.95      954.85      735.47    3%      23%
>  memcg_high          124,183      127,513     138,651     133,884
>  memcg_swap_high           0            0           0           0
>  memcg_swap_fail     619,020      751,099           0           0
>  pswpin                    0            0           0           0
>  pswpout                   0            0           0           0
>  zswpin                  656          569         624         639
>  zswpout           9,413,603   11,284,812   9,453,761   9,385,910

I would expect zswpout to either remain the same or slightly increase when using
CONFIG_THP_SWAP. But for deflate-iaa, there is a 17% decrease in zswpout, which
doesn't make sense?

>  thp_swpout                0            0           0           0
>  thp_swpout_               0            0           0           0
>   fallback
>  pgmajfault            3,470        3,382       4,633       3,611
>  ZSWPOUT-64kB            n/a          n/a     590,768     586,521
>  SWPOUT-64kB               0            0           0           0
>  -------------------------------------------------------------------------------
> 
> 
>  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
>  =======================================================
> 
>  ------------------------------------------------------------------------------
>                        v6.11-rc3 mainline              zswap-mTHP    Change wrt
>                                  Baseline                              Baseline
>                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
>  ------------------------------------------------------------------------------
>  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
>                                      iaa                     iaa            iaa
>  ------------------------------------------------------------------------------
>  Throughput (KB/s)    164,220    172,523      165,005     174,536  0.5%      1%
>  sys time (sec)        855.76     686.94       801.72      676.65    6%      1%
>  memcg_high            14,628     16,247       14,951      16,096
>  memcg_swap_high            0          0            0           0
>  memcg_swap_fail       18,698     21,114            0           0
>  pswpin                     0          0            0           0
>  pswpout                    0          0            0           0
>  zswpin                   663        665        5,333         781
>  zswpout            8,419,458  8,992,065    8,546,895   9,355,760
>  thp_swpout                 0          0            0           0
>  thp_swpout_           18,697     21,113            0           0
>   fallback
>  pgmajfault             3,439      3,496        8,139       3,582
>  ZSWPOUT-2048kB           n/a        n/a       16,684      18,270
>  SWPOUT-2048kB              0          0            0           0
>  -----------------------------------------------------------------------------
> 
> We see improvements overall in throughput and sys time for zstd and
> deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y).
> 
> 
> Case 2: Baseline with CONFIG_THP_SWAP enabled.
> ==============================================
> 
> In this scenario, the "before" represents zswap rejecting mTHP, and the mTHP
> being stored by the backing swap device.
> 


Just curious, how did you make the before case of zswap rejecting mTHP work?

> The "after" represents data with this patch-series, that results in 64K/2M
> (m)THP being processed by zswap.
> 
>  64KB mTHP (cgroup memory.high set to 40G):
>  ==========================================
> 
>  ------------------------------------------------------------------------------
>                      v6.11-rc3 mainline              zswap-mTHP      Change wrt
>                                Baseline                                Baseline
>  ------------------------------------------------------------------------------
>  ZSWAP compressor       zstd   deflate-        zstd    deflate-   zstd deflate-
>                                     iaa                     iaa             iaa
>  ------------------------------------------------------------------------------
>  Throughput (KB/s)   161,496    156,343     140,363     151,938   -13%      -3%
>  sys time (sec)       771.68     802.08      954.85      735.47   -24%       8%
>  memcg_high          111,223    110,889     138,651     133,884
>  memcg_swap_high           0          0           0           0
>  memcg_swap_fail           0          0           0           0
>  pswpin                   16         16           0           0
>  pswpout           7,471,472  7,527,963           0           0
>  zswpin                  635        605         624         639
>  zswpout               1,509      1,478   9,453,761   9,385,910
>  thp_swpout                0          0           0           0
>  thp_swpout_               0          0           0           0
>   fallback
>  pgmajfault            3,616      3,430       4,633       3,611
>  ZSWPOUT-64kB            n/a        n/a     590,768     586,521
>  SWPOUT-64kB         466,967    470,498           0           0
>  ------------------------------------------------------------------------------
> 
>  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
>  =======================================================
> 
>  ------------------------------------------------------------------------------
>                       v6.11-rc3 mainline              zswap-mTHP     Change wrt
>                                 Baseline                               Baseline
>  ------------------------------------------------------------------------------
>  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
>                                      iaa                     iaa            iaa
>  ------------------------------------------------------------------------------
>  Throughput (KB/s)    192,164    194,643     165,005     174,536  -14%     -10%
>  sys time (sec)        823.55     830.42      801.72      676.65    3%      19%
>  memcg_high            16,054     15,936      14,951      16,096
>  memcg_swap_high            0          0           0           0
>  memcg_swap_fail            0          0           0           0
>  pswpin                     0          0           0           0
>  pswpout            8,629,248  8,628,907           0           0
>  zswpin                   560        645       5,333         781
>  zswpout                1,416      1,503   8,546,895   9,355,760
>  thp_swpout            16,854     16,853           0           0
>  thp_swpout_                0          0           0           0
>   fallback
>  pgmajfault             3,341      3,574       8,139       3,582
>  ZSWPOUT-2048kB           n/a        n/a      16,684      18,270
>  SWPOUT-2048kB         16,854     16,853           0           0
>  ------------------------------------------------------------------------------
> 
> In the "Before" scenario, when zswap does not store mTHP, only allocations
> count towards the cgroup memory limit. However, in the "After" scenario,
> with the introduction of zswap_store() mTHP, both, allocations as well as
> the zswap compressed pool usage from all 70 processes are counted towards
> the memory limit. As a result, we see higher swapout activity in the
> "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> charge leads to more frequent memory.high breaches.
> 

hmm, if that was the case, wouldn't "after" zswpout be much more than the "before"
pswpout. But they look very similar? (Even goes down for zstd)

If pswpout in before is approximately equal to zswpout in after, then doesnt it mean
that swap is performing better than zswap? which probably shouldnt happen.

Thanks,
Usama

> This causes degradation in throughput and sys time with zswap mTHP, more so
> in case of zstd than deflate-iaa. Compress latency could play a part in
> this - when there is more swapout activity happening, a slower compressor
> would cause allocations to stall for any/all of the 70 processes.
> 
> In my opinion, even though the test set up does not provide an accurate
> way for a direct before/after comparison (because of zswap usage being
> counted in cgroup, hence towards the memory.high), it still seems
> reasonable for zswap_store to support (m)THP, so that further performance
> improvements can be implemented.
> 
> One of the ideas that has shown promise in our experiments is to improve
> ZSWAP mTHP store performance using batching. With IAA compress/decompress
> batching used in ZSWAP, we are able to demonstrate significant
> performance improvements and memory savings with IAA in scalability
> experiments, as compared to software compressors. We hope to submit
> this work as subsequent RFCs.
> 
> I would greatly appreciate your code review comments and suggestions!
> 
> Thanks,
> Kanchana
> 
> [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
> 
> 
> Kanchana P Sridhar (3):
>   mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
>   mm: zswap: zswap_store() extended to handle mTHP folios.
>   mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout
>     stats.
> 
>  include/linux/huge_mm.h    |   1 +
>  include/linux/memcontrol.h |   4 +
>  mm/Kconfig                 |   8 ++
>  mm/huge_memory.c           |   3 +
>  mm/page_io.c               |   3 +-
>  mm/zswap.c                 | 243 +++++++++++++++++++++++++++----------
>  6 files changed, 200 insertions(+), 62 deletions(-)
> 
> 
> base-commit: 9287e4adbc6ab8fa04d25eb82e097fed877a4642



^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
  2024-09-02 14:40 ` Usama Arif
@ 2024-09-20 19:31   ` Sridhar, Kanchana P
  0 siblings, 0 replies; 34+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-20 19:31 UTC (permalink / raw)
  To: Usama Arif, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosryahmed@google.com, nphamcs@gmail.com,
	chengming.zhou@linux.dev, ryan.roberts@arm.com, Huang, Ying,
	21cnbao@gmail.com, akpm@linux-foundation.org, Sridhar, Kanchana P
  Cc: Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh

Hi Usama,

> -----Original Message-----
> From: Usama Arif <usamaarif642@gmail.com>
> Sent: Monday, September 2, 2024 7:41 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org
> Cc: Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios
> 
> 
> 
> On 29/08/2024 17:27, Kanchana P Sridhar wrote:
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store mTHP
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to v6.11-rc3 in patch 2/4 of this series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> >      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Additionally, there is an attempt to modularize some of the functionality
> > in zswap_store(), to make it more amenable to supporting any-order
> > mTHPs. For instance, the function zswap_store_entry() stores a
> zswap_entry
> > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> > delete all offsets corresponding to a higher order folio stored in zswap.
> >
> > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > "zswpout" counters that get incremented upon successful zswap_store of
> > an mTHP folio:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by
> default)
> > will enable/disable zswap storing of (m)THP. When disabled, zswap will
> > fallback to rejecting the mTHP folio, to be processed by the backing
> > swap device.
> >
> > This patch-series is a precursor to ZSWAP compress batching of mTHP
> > swap-out and decompress batching of swap-ins based on
> swapin_readahead(),
> > using Intel IAA hardware acceleration, which we would like to submit in
> > subsequent RFC patch-series, with performance improvement data.
> >
> Hi Kanchana,
> 
> If I am repeating any of the questions raised in previous revisions
> over here, please feel free to just point to earlier responses!

Sure, no problem. Thanks for the questions and observations with regards
to the data posted in v6!

> 
> Just wanted to check what does compress batching of mTHP swap-out
> means?
> Does it mean that zswap will not compress mTHP page by page, but will
> compress the entire mTHP?
> If it improves performance and possibly the numbers for case 2 below, maybe
> its worth
> adding it to this series?

With Intel IAA, we have the opportunity to make use of compression
and decompression engines in hardware to do parallel compressions during
swapout and parallel decompressions during swapin with readahead.
If compressions can be parallelized, we can improve reclaim performance.
If decompressions can be parallelized, we can improve page-fault handling
performance.

We have implemented compress batching within mTHP folios during
zswap store, as well as compress batching of any-order folios during
shrink_folio_list() -- swap_writepage() using a plug mechanism, similar
to the existing swap_write_unplug() implementation.

Initially, our solution works at the granularity of compressing PAGE_SIZE
pages within (many) folios in parallel, to maximize throughput with IAA
and minimize latency per folio store/load. This is the compress/decompress
batching I was referring to. To utilize IAA compress/decompress engines,
we have developed the respective batching interfaces from
shrink_folio_list() and from swapin_readahead(). Our experiments
in multi-instance, highly contended scenarios under memory pressure,
have demonstrated significant kernel and workload level performance
improvements and overall system level memory savings. I was intending
to submit this functionality as patch-series separate from the basic
"mm: zswap: support mTHP swapout in zswap_store()" (this patch-series)
as in my response to Yosry. As long as we can demonstrate that zswap-mTHP
swapout is beneficial in and of itself, I believe we can submit IAA batching
improvements as separate patch series, as noted in my response to Yosry.

We are also staying tuned in to Barry Song's mTHP swapin efforts
to eventually be able to swapout/swapin an mTHP as a single entity.
In this case also, IAA byN can compress/decompress a tunable number
of chunks of an mTHP in parallel [1].

The IAA byN approach is dependent on Barry's patchsets for mTHP
swapin [2] and associated zsmalloc updates for storing larger compressed
buffers [3]. Please note that Barry's work is focused on ZRAM/sync IO mTHP
swapin and not for ZSWAP.

[1] https://lore.kernel.org/all/8fe04e86f0907588d210885ac91965960f97f450.1714581792.git.andre.glover@linux.intel.com/T/#u
[2] https://patchwork.kernel.org/project/linux-mm/cover/20240908232119.2157-1-21cnbao@gmail.com/
[3] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/

> 
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Thanks also to Nhat, Yosry and Barry for their helpful feedback, data
> > reviews and suggestions!
> >
> > Changes since v5:
> > =================
> > 1) Rebased to mm-unstable as of 8/29/2024,
> >    commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
> > 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
> >    enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
> >    suggestion to add a knob by which users can enable/disable this
> >    change. Nhat, I hope this is along the lines of what you were
> >    thinking.
> > 3) Added vm-scalability usemem data with 4K folios with
> >    CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make
> sure
> >    there is no regression with this change.
> > 4) Added data with usemem with 64K and 2M THP for an alternate view of
> >    before/after, as suggested by Yosry, so we can understand the impact
> >    of when mTHPs are split into 4K folios in shrink_folio_list()
> >    (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
> >    in zswap. Thanks Yosry for this suggestion.
> >
> > Changes since v4:
> > =================
> > 1) Published before/after data with zstd, as suggested by Nhat (Thanks
> >    Nhat for the data reviews!).
> > 2) Rebased to mm-unstable from 8/27/2024,
> >    commit b659edec079c90012cf8d05624e312d1062b8b87.
> > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
> >    CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
> >    robot; as per Nhat's and Michal's suggestion to not require a separate
> >    patch to fix the build errors (thanks both!).
> > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
> >    suggested by Yosry (Thanks Yosry!).
> > 5) Squashed the commits that define new mthp zswpout stat counters, and
> >    invoke count_mthp_stat() after successful zswap_store()s; into a single
> >    commit. Thanks Yosry for this suggestion!
> >
> > Changes since v3:
> > =================
> > 1) Rebased to mm-unstable commit
> 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> >    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> >    changes to count_mthp_stat() so that it's always defined, even when THP
> >    is disabled. Barry, I have also made one other change in page_io.c
> >    where count_mthp_stat() is called by count_swpout_vm_event(). I would
> >    appreciate it if you can review this. Thanks!
> >    Hopefully this should resolve the kernel robot build errors.
> >
> > Changes since v2:
> > =================
> > 1) Gathered usemem data using SSD as the backing swap device for zswap,
> >    as suggested by Ying Huang. Ying, I would appreciate it if you can
> >    review the latest data. Thanks!
> > 2) Generated the base commit info in the patches to attempt to address
> >    the kernel test robot build errors.
> > 3) No code changes to the individual patches themselves.
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> >    Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> >    Ryan's initial RFC [1]:
> >    - Added a comment about the cgroup zswap limit checks occuring once
> per
> >      folio at the beginning of zswap_store().
> >      Nhat, Ryan, please do let me know if the comments convey the summary
> >      from the RFC discussion. Thanks!
> >    - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> >
> > Regression Testing:
> > ===================
> > I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
> > folios with mm-unstable and with this patch-series. The main goal was
> > to make sure that there is no functional or performance regression
> > wrt the earlier zswap behavior for 4K folios,
> > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of
> 4K
> > pages goes through the newly added code path [zswap_store(),
> > zswap_store_page()].
> >
> > The data indicates there is no regression.
> >
> >  ------------------------------------------------------------------------------
> >                      mm-unstable 8-28-2024                        zswap-mTHP v6
> >                                               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
> >                                                                      is not set
> >  ------------------------------------------------------------------------------
> >  ZSWAP compressor        zstd     deflate-                     zstd    deflate-
> >                                        iaa                                  iaa
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)    110,775      113,010               111,550        121,937
> >  sys time (sec)      1,141.72       954.87              1,131.95         828.47
> >  memcg_high           140,500      153,737               139,772        134,129
> >  memcg_swap_high            0            0                     0              0
> >  memcg_swap_fail            0            0                     0              0
> >  pswpin                     0            0                     0              0
> >  pswpout                    0            0                     0              0
> >  zswpin                   675          690                   682            684
> >  zswpout            9,552,298   10,603,271             9,566,392      9,267,213
> >  thp_swpout                 0            0                     0              0
> >  thp_swpout_                0            0                     0              0
> >   fallback
> >  pgmajfault             3,453        3,468                 3,841          3,487
> >  ZSWPOUT-64kB-mTHP        n/a          n/a                     0              0
> >  SWPOUT-64kB-mTHP           0            0                     0              0
> >  ------------------------------------------------------------------------------
> >
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > and with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >
> > The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as
> the
> > backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
> > Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed at 40G. The is no swap limit set for the cgroup. Following a
> > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
> > series [2], 70 usemem processes were run, each allocating and writing 1G of
> > memory:
> >
> >     usemem --init-time -w -O -n 70 1g
> >
> > The vm/sysfs mTHP stats included with the performance data provide
> details
> > on the swapout activity to ZSWAP/swap.
> >
> > Other kernel configuration parameters:
> >
> >     ZSWAP Compressors : zstd, deflate-iaa
> >     ZSWAP Allocator   : zsmalloc
> >     SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput is derived by averaging the individual 70 processes' throughputs
> > reported by usemem. sys time is measured with perf. All data points are
> > averaged across 3 runs.
> >
> > Case 1: Baseline with CONFIG_THP_SWAP turned off, and mTHP is split in
> reclaim.
> >
> ==============================================================
> =================
> >
> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
> > 64K/2M (m)THP to be split, and only 4K folios processed by zswap.
> >
> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
> results
> > in 64K/2M (m)THP to not be split, and processed by zswap.
> >
> >  64KB mTHP (cgroup memory.high set to 40G):
> >  ==========================================
> >
> >  -------------------------------------------------------------------------------
> >                        v6.11-rc3 mainline              zswap-mTHP     Change wrt
> >                                  Baseline                               Baseline
> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)   136,113      140,044     140,363     151,938    3%       8%
> >  sys time (sec)       986.78       951.95      954.85      735.47    3%      23%
> >  memcg_high          124,183      127,513     138,651     133,884
> >  memcg_swap_high           0            0           0           0
> >  memcg_swap_fail     619,020      751,099           0           0
> >  pswpin                    0            0           0           0
> >  pswpout                   0            0           0           0
> >  zswpin                  656          569         624         639
> >  zswpout           9,413,603   11,284,812   9,453,761   9,385,910
> 
> I would expect zswpout to either remain the same or slightly increase when
> using
> CONFIG_THP_SWAP. But for deflate-iaa, there is a 17% decrease in zswpout,
> which
> doesn't make sense?

Good question. Without CONFIG_THP_SWAP, we see 751,099 memcg_swap_fail
counts with deflate-iaa. With CONFIG_THP_SWAP, we see 0 memcg_swap_fail
counts with deflate-iaa. My interpretation of this data is that with
CONFIG_THP_SWAP, the main contributing factors to memcg.high breaches
are faster swapout causing faster allocations + cgroup zswap charging.
Without CONFIG_THP_SWAP, there seems to be an additional contribution
of pages that remain in memory due to swap slot allocation failures; and
hence more swapouts. Could there also be some effect of the reclaim
path latency overhead of making 16 calls to swap_writepage() per mTHP
that is split, vs. making one call in the case of zswap-mTHP? Would appreciate
other analyses and explanations.

> 
> >  thp_swpout                0            0           0           0
> >  thp_swpout_               0            0           0           0
> >   fallback
> >  pgmajfault            3,470        3,382       4,633       3,611
> >  ZSWPOUT-64kB            n/a          n/a     590,768     586,521
> >  SWPOUT-64kB               0            0           0           0
> >  -------------------------------------------------------------------------------
> >
> >
> >  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> >  =======================================================
> >
> >  ------------------------------------------------------------------------------
> >                        v6.11-rc3 mainline              zswap-mTHP    Change wrt
> >                                  Baseline                              Baseline
> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
> >  ------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
> >                                      iaa                     iaa            iaa
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)    164,220    172,523      165,005     174,536  0.5%      1%
> >  sys time (sec)        855.76     686.94       801.72      676.65    6%      1%
> >  memcg_high            14,628     16,247       14,951      16,096
> >  memcg_swap_high            0          0            0           0
> >  memcg_swap_fail       18,698     21,114            0           0
> >  pswpin                     0          0            0           0
> >  pswpout                    0          0            0           0
> >  zswpin                   663        665        5,333         781
> >  zswpout            8,419,458  8,992,065    8,546,895   9,355,760
> >  thp_swpout                 0          0            0           0
> >  thp_swpout_           18,697     21,113            0           0
> >   fallback
> >  pgmajfault             3,439      3,496        8,139       3,582
> >  ZSWPOUT-2048kB           n/a        n/a       16,684      18,270
> >  SWPOUT-2048kB              0          0            0           0
> >  -----------------------------------------------------------------------------
> >
> > We see improvements overall in throughput and sys time for zstd and
> > deflate-iaa, when comparing before (THP_SWAP=N) vs. after
> (THP_SWAP=Y).
> >
> >
> > Case 2: Baseline with CONFIG_THP_SWAP enabled.
> > ==============================================
> >
> > In this scenario, the "before" represents zswap rejecting mTHP, and the
> mTHP
> > being stored by the backing swap device.
> >
> 
> 
> Just curious, how did you make the before case of zswap rejecting mTHP
> work?

I suppose your question is about the experimental setup used for "before"?
If so, the kernel I used was v6.11-rc3 in which zswap rejects mTHP stores,
and mTHP gets processed in __swap_writepage(). For the v6 data, I had
176GiB ZRAM (35% of available RAM) as the backing swap device for ZSWAP.
Hence the mTHPs would be processed by swap_writepage_bdev_sync().
Please let me know if this answers your question.

> 
> > The "after" represents data with this patch-series, that results in 64K/2M
> > (m)THP being processed by zswap.
> >
> >  64KB mTHP (cgroup memory.high set to 40G):
> >  ==========================================
> >
> >  ------------------------------------------------------------------------------
> >                      v6.11-rc3 mainline              zswap-mTHP      Change wrt
> >                                Baseline                                Baseline
> >  ------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd   deflate-        zstd    deflate-   zstd deflate-
> >                                     iaa                     iaa             iaa
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)   161,496    156,343     140,363     151,938   -13%      -3%
> >  sys time (sec)       771.68     802.08      954.85      735.47   -24%       8%
> >  memcg_high          111,223    110,889     138,651     133,884
> >  memcg_swap_high           0          0           0           0
> >  memcg_swap_fail           0          0           0           0
> >  pswpin                   16         16           0           0
> >  pswpout           7,471,472  7,527,963           0           0
> >  zswpin                  635        605         624         639
> >  zswpout               1,509      1,478   9,453,761   9,385,910
> >  thp_swpout                0          0           0           0
> >  thp_swpout_               0          0           0           0
> >   fallback
> >  pgmajfault            3,616      3,430       4,633       3,611
> >  ZSWPOUT-64kB            n/a        n/a     590,768     586,521
> >  SWPOUT-64kB         466,967    470,498           0           0
> >  ------------------------------------------------------------------------------
> >
> >  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> >  =======================================================
> >
> >  ------------------------------------------------------------------------------
> >                       v6.11-rc3 mainline              zswap-mTHP     Change wrt
> >                                 Baseline                               Baseline
> >  ------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
> >                                      iaa                     iaa            iaa
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)    192,164    194,643     165,005     174,536  -14%     -10%
> >  sys time (sec)        823.55     830.42      801.72      676.65    3%      19%
> >  memcg_high            16,054     15,936      14,951      16,096
> >  memcg_swap_high            0          0           0           0
> >  memcg_swap_fail            0          0           0           0
> >  pswpin                     0          0           0           0
> >  pswpout            8,629,248  8,628,907           0           0
> >  zswpin                   560        645       5,333         781
> >  zswpout                1,416      1,503   8,546,895   9,355,760
> >  thp_swpout            16,854     16,853           0           0
> >  thp_swpout_                0          0           0           0
> >   fallback
> >  pgmajfault             3,341      3,574       8,139       3,582
> >  ZSWPOUT-2048kB           n/a        n/a      16,684      18,270
> >  SWPOUT-2048kB         16,854     16,853           0           0
> >  ------------------------------------------------------------------------------
> >
> > In the "Before" scenario, when zswap does not store mTHP, only allocations
> > count towards the cgroup memory limit. However, in the "After" scenario,
> > with the introduction of zswap_store() mTHP, both, allocations as well as
> > the zswap compressed pool usage from all 70 processes are counted
> towards
> > the memory limit. As a result, we see higher swapout activity in the
> > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> > charge leads to more frequent memory.high breaches.
> >
> 
> hmm, if that was the case, wouldn't "after" zswpout be much more than the
> "before"
> pswpout. But they look very similar? (Even goes down for zstd)

For 64k mTHP, the "after" zswpout is considerably more than "before" pswpout,
and so are the memcg_high counts. My comments were based on this
(my apologies: I should have been more specific).

In case of 2M THP, you are right: the "after" zswpout and "before" pswpout
are quite similar.

> 
> If pswpout in before is approximately equal to zswpout in after, then doesnt it
> mean
> that swap is performing better than zswap? which probably shouldnt happen.

Agreed. Based on comments from Yosry and Nhat, I have posted 64k mTHP
data with 4G SSD backing zswap, instead of 175G ZRAM backing zswap. If we agree
to continue using 4G SSD as the backing device, I can gather data with 2M THP
as well for further analysis of this patchset.

Thanks,
Kanchana

> 
> Thanks,
> Usama
> 
> > This causes degradation in throughput and sys time with zswap mTHP, more
> so
> > in case of zstd than deflate-iaa. Compress latency could play a part in
> > this - when there is more swapout activity happening, a slower compressor
> > would cause allocations to stall for any/all of the 70 processes.
> >
> > In my opinion, even though the test set up does not provide an accurate
> > way for a direct before/after comparison (because of zswap usage being
> > counted in cgroup, hence towards the memory.high), it still seems
> > reasonable for zswap_store to support (m)THP, so that further performance
> > improvements can be implemented.
> >
> > One of the ideas that has shown promise in our experiments is to improve
> > ZSWAP mTHP store performance using batching. With IAA
> compress/decompress
> > batching used in ZSWAP, we are able to demonstrate significant
> > performance improvements and memory savings with IAA in scalability
> > experiments, as compared to software compressors. We hope to submit
> > this work as subsequent RFCs.
> >
> > I would greatly appreciate your code review comments and suggestions!
> >
> > Thanks,
> > Kanchana
> >
> > [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-
> ryan.roberts@arm.com/
> >
> >
> > Kanchana P Sridhar (3):
> >   mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
> >   mm: zswap: zswap_store() extended to handle mTHP folios.
> >   mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout
> >     stats.
> >
> >  include/linux/huge_mm.h    |   1 +
> >  include/linux/memcontrol.h |   4 +
> >  mm/Kconfig                 |   8 ++
> >  mm/huge_memory.c           |   3 +
> >  mm/page_io.c               |   3 +-
> >  mm/zswap.c                 | 243 +++++++++++++++++++++++++++----------
> >  6 files changed, 200 insertions(+), 62 deletions(-)
> >
> >
> > base-commit: 9287e4adbc6ab8fa04d25eb82e097fed877a4642


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-09-20 23:45 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-29 21:27 [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
2024-08-29 21:27 ` [PATCH v6 1/3] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
2024-08-29 21:27 ` [PATCH v6 2/3] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar
2024-08-29 23:06   ` Yosry Ahmed
2024-09-20  1:57     ` Sridhar, Kanchana P
2024-09-02 11:37   ` Chengming Zhou
2024-09-20  2:43     ` Sridhar, Kanchana P
2024-09-16  5:55   ` Barry Song
2024-09-20 20:53     ` Sridhar, Kanchana P
2024-08-29 21:27 ` [PATCH v6 3/3] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats Kanchana P Sridhar
2024-08-30  0:19   ` Nhat Pham
2024-09-20  2:32     ` Sridhar, Kanchana P
2024-09-20 22:57   ` Yosry Ahmed
2024-09-20 23:28     ` Sridhar, Kanchana P
2024-08-29 22:48 ` [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios Yosry Ahmed
2024-08-29 23:45   ` Nhat Pham
2024-08-29 23:54     ` Yosry Ahmed
2024-08-30  0:06       ` Nhat Pham
2024-08-30  0:14         ` Yosry Ahmed
2024-09-20  2:30           ` Sridhar, Kanchana P
2024-09-20  2:26         ` Sridhar, Kanchana P
2024-09-20  2:22       ` Sridhar, Kanchana P
2024-09-20  2:16     ` Sridhar, Kanchana P
2024-09-20  9:12       ` Huang, Ying
2024-09-20 16:53         ` Sridhar, Kanchana P
2024-08-30  9:27   ` Huang, Ying
2024-09-20  2:41     ` Sridhar, Kanchana P
2024-09-20  1:41   ` Sridhar, Kanchana P
2024-09-20  9:29     ` Huang, Ying
2024-09-20 17:57       ` Sridhar, Kanchana P
2024-09-20 23:15     ` Yosry Ahmed
2024-09-20 23:45       ` Sridhar, Kanchana P
2024-09-02 14:40 ` Usama Arif
2024-09-20 19:31   ` Sridhar, Kanchana P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).