[RFC PATCH v2 1/5] workingset: simplify and use a more intuitive model

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Yu Zhao <yuzhao@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.com>, Hugh Dickins <hughd@google.com>,
	Nhat Pham <nphamcs@gmail.com>, Yuanchu Xie <yuanchu@google.com>,
	Suren Baghdasaryan <surenb@google.com>,
	"T . J . Mercier" <tjmercier@google.com>,
	linux-kernel@vger.kernel.orng, Kairui Song <kasong@tencent.com>
Subject: [RFC PATCH v2 1/5] workingset: simplify and use a more intuitive model
Date: Wed, 13 Sep 2023 02:45:07 +0800	[thread overview]
Message-ID: <20230912184511.49333-2-ryncsn@gmail.com> (raw)
In-Reply-To: <20230912184511.49333-1-ryncsn@gmail.com>

From: Kairui Song <kasong@tencent.com>

This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.

The idea behind this change is a new way to calculate the refault
distance and prepare for adapting refault distance based re-activation
for multi-gen LRU.

Currently, refault distance re-activation is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (considering
   LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.

Assumption 2 is correct, but assumption 1 is not always true, an activated
page could be anywhere in the LRU list (through mark_page_accessed), it
only left-shift the pages on its right.

And besides, one page can get activate/deactivated for multiple times.

And multi-gen LRU doesn't fit with this model well, pages are getting
aged and activated constantly as the generation sliding window slides.

So instead we introduce a simpler idea here: Just presume the evicted
pages are still in memory, each has an eviction sequence like before.
Let the `nonresistence_age` still be NA and get increased for each
eviction, so we get a "Shadow LRU" here of one evicted page:

  Let SP = ((NA's reading @ current) - (NA's reading @ eviction))

                           +-memory available to cache-+
                           |                           |
 +-------------------------+===============+===========+
 | *   shadows  O O  O     |   INACTIVE    |   ACTIVE  |
 +-+-----------------------+===============+===========+
   |                       |
   +-----------------------+
   |         SP
 fault page          O -> Hole left by previously faulted in pages
                     * -> The page corresponding to SP

It can be easily seen that SP stands for how far the current workflow
could push a page out of available memory. Since all evicted page was
once head of INACTIVE list, the page could have such an access distance:

  SP + NR_INACTIVE

It *may* get re-activated before getting evicted again if:

  SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE

Which can be simplified to:

  SP < NR_ACTIVE

Then the page is worth getting re-activated to start from ACTIVE part,
since the access distance is shorter than the total memory to make it
stay. The calculation is same as before, just dropped the assumption 1
above.

And since this is only an estimation, based on several hypotheses, and
it could break the ability of LRU to distinguish a workingset out of
caches, so throttle this by two factors:

1. Notice previously re-faulted in pages may leave "holes" on the shadow
   part of LRU, that part is left unhandled on purpose to decrease
   re-activate rate for pages that have a large SP value (the larger
   SP value a page has, the more likely it will be affected by such
   holes).
2. When the ACTIVE part of LRU is long enough, chanllaging ACTIVE pages
   by re-activating a one-time faulted previously INACTIVE page may not
   be a good idea, so throttle the re-activation when ACTIVE > INACTIVE
   by comparing with INACTIVE instead.

Combined all above, we have:
Upon refault, if any of following conditions is met, mark page as active:

- If ACTIVE LRU is low (NR_ACTIVE < NR_INACTIVE), check if:
  SP < NR_ACTIVE

- If ACTIVE LRU is high (NR_ACTIVE >= NR_INACTIVE), check if:
  SP < NR_INACTIVE

The code is almose same but simpler than before, since no longer need to
do lruvec statistic update when activating a page. A few benchmarks
showed a similar or better result. And when combined with multi-gen
LRU (in later commits) it shows a measurable performance gain for some
workloads.

Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:

  memtier test (with 16G ramdisk as swap and 2G memcg limit on an i7-9700):
  memcached -u nobody -m 16384 -s /tmp/memcached.socket \
    -a 0700 -t 12 -B binary &
  memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
    --key-minimum=1 --key-maximum=24000000 --key-pattern=P:P -c 1 \
    -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6

  fio test 1 (with 16G ramdisk on 28G VM on an i7-9700):
  fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
    --buffered=1 --ioengine=io_uring --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=randread --random_distribution=random --norandommap \
    --time_based --ramp_time=5m --runtime=5m --group_reporting

  fio test 2 (with 16G ramdisk on 28G VM on an i7-9700):
  fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \
    --buffered=1 --ioengine=io_uring --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=randread --random_distribution=zipf:1.2 --norandommap \
    --time_based --ramp_time=10m --runtime=5m --group_reporting

  mysql (using oltp_read_only from sysbench, with 12G of buffer pool
  in a 10G memcg):
  sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
    --mysql-db=sb --tables=36 --table-size=2000000 --threads=12 --time=1800

Before (Average of 6 test run):
fio: IOPS=5213.7k
fio2: IOPS=7315.3k
memcached: 49493.75 ops/s
mysql: 6237.45 tps

After (Average of 6 test run):
fio: IOPS=5230.5k
fio2: IOPS=7349.3k
memcached: 49912.79 ops/s
mysql: 6240.62 tps

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   2 -
 mm/swap.c            |   1 -
 mm/vmscan.c          |   2 -
 mm/workingset.c      | 215 +++++++++++++++++++++----------------------
 4 files changed, 106 insertions(+), 114 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 493487ed7c38..ca51d79842b7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -344,10 +344,8 @@ static inline swp_entry_t page_swap_entry(struct page *page)
 
 /* linux/mm/workingset.c */
 bool workingset_test_recent(void *shadow, bool file, bool *workingset);
-void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
 void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg);
 void workingset_refault(struct folio *folio, void *shadow);
-void workingset_activation(struct folio *folio);
 
 /* Only track the nodes of mappings with shadow entries */
 void workingset_update_node(struct xa_node *node);
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..685b446fd4f9 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -482,7 +482,6 @@ void folio_mark_accessed(struct folio *folio)
 		else
 			__lru_cache_activate_folio(folio);
 		folio_clear_referenced(folio);
-		workingset_activation(folio);
 	}
 	if (folio_test_idle(folio))
 		folio_clear_idle(folio);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6f13394b112e..3f4de75e5186 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2539,8 +2539,6 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 		lruvec_add_folio(lruvec, folio);
 		nr_pages = folio_nr_pages(folio);
 		nr_moved += nr_pages;
-		if (folio_test_active(folio))
-			workingset_age_nonresident(lruvec, nr_pages);
 	}
 
 	/*
diff --git a/mm/workingset.c b/mm/workingset.c
index da58a26d0d4d..babda11601ea 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -180,9 +180,10 @@
  */
 
 #define WORKINGSET_SHIFT 1
-#define EVICTION_SHIFT	((BITS_PER_LONG - BITS_PER_XA_VALUE) +	\
+#define EVICTION_SHIFT	((BITS_PER_LONG - BITS_PER_XA_VALUE) + \
 			 WORKINGSET_SHIFT + NODES_SHIFT + \
 			 MEM_CGROUP_ID_SHIFT)
+#define EVICTION_BITS	(BITS_PER_LONG - (EVICTION_SHIFT))
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
 /*
@@ -226,8 +227,103 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*workingsetp = workingset;
 }
 
-#ifdef CONFIG_LRU_GEN
+/*
+ * Get the distance reading at eviction time.
+ */
+static inline unsigned long lru_eviction(struct lruvec *lruvec,
+					 int bits, int bucket_order)
+{
+	unsigned long eviction = atomic_long_read(&lruvec->nonresident_age);
+
+	eviction >>= bucket_order;
+	eviction &= ~0UL >> (BITS_PER_LONG - bits);
+
+	return eviction;
+}
+
+/*
+ * Calculate and test refault distance
+ */
+static inline bool lru_refault(struct mem_cgroup *memcg,
+			       struct lruvec *lruvec,
+			       unsigned long eviction, bool file,
+			       int bits, int bucket_order)
+{
+	unsigned long refault, distance;
+	unsigned long workingset, active, inactive, inactive_file, inactive_anon = 0;
+
+	eviction <<= bucket_order;
+	refault = atomic_long_read(&lruvec->nonresident_age);
+
+	/*
+	 * The unsigned subtraction here gives an accurate distance
+	 * across nonresident_age overflows in most cases. There is a
+	 * special case: usually, shadow entries have a short lifetime
+	 * and are either refaulted or reclaimed along with the inode
+	 * before they get too old.  But it is not impossible for the
+	 * nonresident_age to lap a shadow entry in the field, which
+	 * can then result in a false small refault distance, leading
+	 * to a false activation should this old entry actually
+	 * refault again.  However, earlier kernels used to deactivate
+	 * unconditionally with *every* reclaim invocation for the
+	 * longest time, so the occasional inappropriate activation
+	 * leading to pressure on the active list is not a problem.
+	 */
+	distance = (refault - eviction) & (~0UL >> (BITS_PER_LONG - bits));
 
+	active = lruvec_page_state(lruvec, NR_ACTIVE_FILE);
+	inactive_file = lruvec_page_state(lruvec, NR_INACTIVE_FILE);
+	if (mem_cgroup_get_nr_swap_pages(memcg) > 0) {
+		active += lruvec_page_state(lruvec, NR_ACTIVE_ANON);
+		inactive_anon = lruvec_page_state(lruvec, NR_INACTIVE_ANON);
+	}
+
+	/*
+	 * Compare the distance to the existing workingset size. We
+	 * don't activate pages that couldn't stay resident even if
+	 * all the memory was available to the workingset. Whether
+	 * workingset competition needs to consider anon or not depends
+	 * on having free swap space.
+	 *
+	 * When there are already enough active pages, be less aggressive
+	 * on reactivating pages, challenge an already established set of
+	 * active pages with one time refaulted page may not be a good idea.
+	 */
+	if (active >= (inactive_anon + inactive_file))
+		return distance < inactive_anon + inactive_file;
+	else
+		return distance < active + (file ? inactive_anon : inactive_file);
+}
+
+/**
+ * workingset_age_nonresident - age non-resident entries as LRU ages
+ * @lruvec: the lruvec that was aged
+ * @nr_pages: the number of pages to count
+ *
+ * As in-memory pages are aged, non-resident pages need to be aged as
+ * well, in order for the refault distances later on to be comparable
+ * to the in-memory dimensions. This function allows reclaim and LRU
+ * operations to drive the non-resident aging along in parallel.
+ */
+static void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
+{
+	/*
+	 * Reclaiming a cgroup means reclaiming all its children in a
+	 * round-robin fashion. That means that each cgroup has an LRU
+	 * order that is composed of the LRU orders of its child
+	 * cgroups; and every page has an LRU position not just in the
+	 * cgroup that owns it, but in all of that group's ancestors.
+	 *
+	 * So when the physical inactive list of a leaf cgroup ages,
+	 * the virtual inactive lists of all its parents, including
+	 * the root cgroup's, age as well.
+	 */
+	do {
+		atomic_long_add(nr_pages, &lruvec->nonresident_age);
+	} while ((lruvec = parent_lruvec(lruvec)));
+}
+
+#ifdef CONFIG_LRU_GEN
 static void *lru_gen_eviction(struct folio *folio)
 {
 	int hist;
@@ -342,34 +438,6 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 
 #endif /* CONFIG_LRU_GEN */
 
-/**
- * workingset_age_nonresident - age non-resident entries as LRU ages
- * @lruvec: the lruvec that was aged
- * @nr_pages: the number of pages to count
- *
- * As in-memory pages are aged, non-resident pages need to be aged as
- * well, in order for the refault distances later on to be comparable
- * to the in-memory dimensions. This function allows reclaim and LRU
- * operations to drive the non-resident aging along in parallel.
- */
-void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
-{
-	/*
-	 * Reclaiming a cgroup means reclaiming all its children in a
-	 * round-robin fashion. That means that each cgroup has an LRU
-	 * order that is composed of the LRU orders of its child
-	 * cgroups; and every page has an LRU position not just in the
-	 * cgroup that owns it, but in all of that group's ancestors.
-	 *
-	 * So when the physical inactive list of a leaf cgroup ages,
-	 * the virtual inactive lists of all its parents, including
-	 * the root cgroup's, age as well.
-	 */
-	do {
-		atomic_long_add(nr_pages, &lruvec->nonresident_age);
-	} while ((lruvec = parent_lruvec(lruvec)));
-}
-
 /**
  * workingset_eviction - note the eviction of a folio from memory
  * @target_memcg: the cgroup that is causing the reclaim
@@ -396,11 +464,11 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
-	eviction = atomic_long_read(&lruvec->nonresident_age);
-	eviction >>= bucket_order;
+
+	eviction = lru_eviction(lruvec, EVICTION_BITS, bucket_order);
 	workingset_age_nonresident(lruvec, folio_nr_pages(folio));
 	return pack_shadow(memcgid, pgdat, eviction,
-				folio_test_workingset(folio));
+			   folio_test_workingset(folio));
 }
 
 /**
@@ -418,9 +486,6 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 {
 	struct mem_cgroup *eviction_memcg;
 	struct lruvec *eviction_lruvec;
-	unsigned long refault_distance;
-	unsigned long workingset_size;
-	unsigned long refault;
 	int memcgid;
 	struct pglist_data *pgdat;
 	unsigned long eviction;
@@ -429,7 +494,6 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 		return lru_gen_test_recent(shadow, file, &eviction_lruvec, &eviction, workingset);
 
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
-	eviction <<= bucket_order;
 
 	/*
 	 * Look up the memcg associated with the stored ID. It might
@@ -450,50 +514,10 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 	eviction_memcg = mem_cgroup_from_id(memcgid);
 	if (!mem_cgroup_disabled() && !eviction_memcg)
 		return false;
-
 	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
-	refault = atomic_long_read(&eviction_lruvec->nonresident_age);
-
-	/*
-	 * Calculate the refault distance
-	 *
-	 * The unsigned subtraction here gives an accurate distance
-	 * across nonresident_age overflows in most cases. There is a
-	 * special case: usually, shadow entries have a short lifetime
-	 * and are either refaulted or reclaimed along with the inode
-	 * before they get too old.  But it is not impossible for the
-	 * nonresident_age to lap a shadow entry in the field, which
-	 * can then result in a false small refault distance, leading
-	 * to a false activation should this old entry actually
-	 * refault again.  However, earlier kernels used to deactivate
-	 * unconditionally with *every* reclaim invocation for the
-	 * longest time, so the occasional inappropriate activation
-	 * leading to pressure on the active list is not a problem.
-	 */
-	refault_distance = (refault - eviction) & EVICTION_MASK;
 
-	/*
-	 * Compare the distance to the existing workingset size. We
-	 * don't activate pages that couldn't stay resident even if
-	 * all the memory was available to the workingset. Whether
-	 * workingset competition needs to consider anon or not depends
-	 * on having free swap space.
-	 */
-	workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
-	if (!file) {
-		workingset_size += lruvec_page_state(eviction_lruvec,
-						     NR_INACTIVE_FILE);
-	}
-	if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) {
-		workingset_size += lruvec_page_state(eviction_lruvec,
-						     NR_ACTIVE_ANON);
-		if (file) {
-			workingset_size += lruvec_page_state(eviction_lruvec,
-						     NR_INACTIVE_ANON);
-		}
-	}
-
-	return refault_distance <= workingset_size;
+	return lru_refault(eviction_memcg, eviction_lruvec, eviction, file,
+			   EVICTION_BITS, bucket_order);
 }
 
 /**
@@ -543,7 +567,6 @@ void workingset_refault(struct folio *folio, void *shadow)
 		goto out;
 
 	folio_set_active(folio);
-	workingset_age_nonresident(lruvec, nr);
 	mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file, nr);
 
 	/* Folio was active prior to eviction */
@@ -560,30 +583,6 @@ void workingset_refault(struct folio *folio, void *shadow)
 	rcu_read_unlock();
 }
 
-/**
- * workingset_activation - note a page activation
- * @folio: Folio that is being activated.
- */
-void workingset_activation(struct folio *folio)
-{
-	struct mem_cgroup *memcg;
-
-	rcu_read_lock();
-	/*
-	 * Filter non-memcg pages here, e.g. unmap can call
-	 * mark_page_accessed() on VDSO pages.
-	 *
-	 * XXX: See workingset_refault() - this should return
-	 * root_mem_cgroup even for !CONFIG_MEMCG.
-	 */
-	memcg = folio_memcg_rcu(folio);
-	if (!mem_cgroup_disabled() && !memcg)
-		goto out;
-	workingset_age_nonresident(folio_lruvec(folio), folio_nr_pages(folio));
-out:
-	rcu_read_unlock();
-}
-
 /*
  * Shadow entries reflect the share of the working set that does not
  * fit into memory, so their number depends on the access pattern of
@@ -778,7 +777,6 @@ static struct lock_class_key shadow_nodes_key;
 
 static int __init workingset_init(void)
 {
-	unsigned int timestamp_bits;
 	unsigned int max_order;
 	int ret;
 
@@ -790,12 +788,11 @@ static int __init workingset_init(void)
 	 * some more pages at runtime, so keep working with up to
 	 * double the initial memory by using totalram_pages as-is.
 	 */
-	timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
 	max_order = fls_long(totalram_pages() - 1);
-	if (max_order > timestamp_bits)
-		bucket_order = max_order - timestamp_bits;
+	if (max_order > EVICTION_BITS)
+		bucket_order = max_order - EVICTION_BITS;
 	pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
-	       timestamp_bits, max_order, bucket_order);
+	       EVICTION_BITS, max_order, bucket_order);
 
 	ret = prealloc_shrinker(&workingset_shadow_shrinker, "mm-shadow");
 	if (ret)
-- 
2.41.0

next prev parent reply	other threads:[~2023-09-12 18:45 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-12 18:45 [RFC PATCH v2 0/5] Refault distance checking for MGLRU Kairui Song
2023-09-12 18:45 ` Kairui Song [this message]
2023-09-12 19:47   ` [RFC PATCH v2 1/5] workingset: simplify and use a more intuitive model Johannes Weiner
2023-09-13  9:26     ` Kairui Song
2023-09-12 21:05   ` kernel test robot
2023-09-12 18:45 ` [RFC PATCH v2 2/5] workingset: update comment in workingset.c Kairui Song
2023-09-12 18:45 ` [RFC PATCH v2 3/5] workingset: simplify lru_gen_test_recent Kairui Song
2023-09-12 21:28   ` kernel test robot
2023-09-13 12:31   ` kernel test robot
2023-09-12 18:45 ` [RFC PATCH v2 4/5] lru_gen: convert avg_total and avg_refaulted to atomic Kairui Song
2023-09-12 18:45 ` [RFC PATCH v2 5/5] workingset, lru_gen: apply refault-distance based re-activation Kairui Song
2023-09-12 21:49   ` kernel test robot

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:493487ed7c3 dfblob:ca51d79842b dfblob:cd8f0150ba3
dfblob:685b446fd4f dfblob:6f13394b112 dfblob:3f4de75e518
dfblob:da58a26d0d4 dfblob:babda11601e )
 OR (
bs:"[RFC PATCH v2 1/5] workingset: simplify and use a more intuitive model" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230912184511.49333-2-ryncsn@gmail.com \
    --to=ryncsn@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.orng \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=nphamcs@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=surenb@google.com \
    --cc=tjmercier@google.com \
    --cc=yuanchu@google.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.